Estimating Vocab Size of IRC Users

  • Thread starter mXSCNT
  • Start date
In summary, the speaker is trying to estimate the total vocabularies of people on IRC by using logs from the past few months and calculating the number of unique vocabulary words spoken and the total number of words spoken. However, simply recording the number of vocabulary words or dividing it by the total number of words would bias the results. To solve this problem, the speaker has created a function for each person that tracks the number of unique vocabulary words spoken by the time they reach a certain number of words. The goal is to fit a regression curve to this function in order to estimate the max vocabulary. The speaker is considering using a Poisson distribution to estimate the statistics for each function, but first needs to solve a problem involving strings of letters and an alphabet
  • #1
mXSCNT
315
1
For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.
 
Physics news on Phys.org
  • #2
mXSCNT said:
For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.

fp(n) is a random variable. How about choosing an upper band where the probability of exceeding is about one in five billion. That is on average we expect at most once person on Earth to break this upper bound. Now we can plot this error bar and if it looks like it follows some kind of function then we can preform regression on it. First though find an estimate of this error bar for each f(n) up to some value N. This will give estimates on the maximum vocabulary but not the total vocabulary. One problem I see is no matter how large you make N who is to say they are going to use their entire vocabulary?

What do you think the statistics are like for each f(n) Do you they follow a poisson distribution?
http://en.wikipedia.org/wiki/Poisson_distribution
 
Last edited:
  • #3
Let's solve the following problem first. Suppose that we have strings of n letters over an alphabet of size k. Let U(n,j,k) denote the number of such strings where exactly j _different_ letters appear. U(3,2,3) = 18, for example, because the strings abb, bab, bba, acc, cac, cca, baa, aba, aab, bcc, cbc, ccb, caa, aca, aac, cbb, bcb, bbc, are all the ways to form strings of length 3 where 2 letters are unique over an alphabet of 3 letters.

To form such a string with j letters different, first one can choose which j letters appear in the string, for C(k, j) ways (j > 0 and C means choose). Then one can choose which position in the string gets mapped to which letter. This is the number of onto functions from the n positions in the string to the j different letters, which is a Stirling number of the second kind, S(n, j). So,
[tex]U(n,j,k) = {k \choose j} S(n, j)[/tex]
Let V(n,k) be the random variable giving the number of unique letters in a string of length n over an alphabet of size k. If each letter in the string is equally likely (and statistically independent of the other letters), then
[tex]P(V(n,k)=j) = {k \choose j} \frac{S(n, j)}{k^n}[/tex].
So the expected number of unique letters is
[tex]E(V(n,k)) = \sum_j j {k \choose j} \frac{S(n,j)}{k^n}[/tex]

Now, back to the problem of estimating a person's vocabulary. It's clear that the probabilities of a person speaking the word "the" or the same person speaking the word "zygomatic" are not the same. But it might be an acceptable simplifying assumption to suppose that, after the first thousand or so most common words, all the remaining words the person knows have equal probability of being said. (If we assume that the person has a finite vocabulary, we can't give their probability of saying a word an exponential distribution or any other common distribution since that has infinite support). Let C be the probability that a word spoken is in the 1000 most common, and let R be the number of words spoken before the 1000 most common appear. The probability that a word is not in the 1000 most common is 1-C, so about (1-C)n words are uncommon in the first n. So then, for n > R we can approximate:
[tex]f_p(n) = V((1-C)n,v_p-1000) + 1000[/tex].
C and R can be empirically calculated, if we have enough data. So all that remains is to find the value of v_p that best fits the function for sufficiently large n, and hope that all the approximations work out.
 
  • #4
In order to use that, though, I'm going to need some kind of approximation to P(V(n,k)=j). I can use Stirling's approximation for the choice function but what can I use to approximate the Stirling number of the second kind?
 
  • #5
mXSCNT said:
So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.

Do poor spellers have an unfair advantage?

Anyway, how about the following approach using maximum likelihood. Suppose each person uses the words in their vocabulary with the same relative frequency as the overall population, i.e.
[tex]P_p(W_i=w)=\frac{P(W_i=w)}{S_p}[/tex]
where P(Wi=w) is the estimated from all words ever used and Sp (unknown) is the sum of P(Wi=w) over all words in the vocabulary. Similarly let Sp(n) be the sum over the vocabulary observed so far.

Now at each step the probability that fp(k+1)-fp(k)=1 (i.e. a new word is used) is
[tex]1-S_p(k)/S_p[/tex]

From this we get a likelihood function based on the sequence of observed values of fp(k), and maximize it (or its log) to estimate Sp.

Since we don't know which words are represented by Sp-Sp(n) out of those not already used, take the most common words not observed so far as the most likely candidates and the count the number of terms required to exceed the value. That gives an estimate of (vp-fp(n)), the number of words in the vocabulary that haven't been used so far.

Hope this helps. I'd be very interested to see the results :)
 

Related to Estimating Vocab Size of IRC Users

What is the purpose of estimating the vocabulary size of IRC users?

The purpose of estimating the vocabulary size of IRC users is to gain a better understanding of the language proficiency and diversity within the IRC community. It can also provide insights into the types of conversations and topics that are prevalent within the community.

How is the vocabulary size of IRC users estimated?

The vocabulary size of IRC users can be estimated by analyzing a sample of conversations from the IRC community and counting the unique words used. This can be done manually or with the help of computer programs, such as text analysis tools.

What factors can affect the accuracy of estimating vocabulary size?

The accuracy of estimating vocabulary size can be affected by factors such as the size and diversity of the sample, the language proficiency of the users, and the context of the conversations. It is also important to consider the limitations of the method used for estimation.

How can the estimated vocabulary size of IRC users be useful?

The estimated vocabulary size of IRC users can be useful for various purposes, such as language learning and teaching, sociolinguistic research, and marketing analysis. It can also provide insights into the cultural and social dynamics of the IRC community.

Are there any limitations to estimating vocabulary size of IRC users?

Yes, there are limitations to estimating vocabulary size of IRC users. The sample of conversations may not be representative of the entire community, and the vocabulary size may vary depending on the time period and topics discussed. Additionally, the estimation may not account for slang, jargon, or regional variations in language use.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
940
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Math Proof Training and Practice
2
Replies
67
Views
10K
  • General Math
4
Replies
125
Views
17K
  • General Discussion
Replies
30
Views
4K
  • Beyond the Standard Models
Replies
14
Views
3K
  • Mechanical Engineering
Replies
1
Views
3K
  • Introductory Physics Homework Help
Replies
2
Views
7K
  • Precalculus Mathematics Homework Help
Replies
2
Views
6K
Back
Top