mXSCNT
- 310
- 1
For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.
However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.
So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.
My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.
However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.
So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.
My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.