- #1

mXSCNT

- 315

- 1

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function f

_{p}(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, f

_{p}(n) should be bounded from above by a constant v

_{p}for each person; v

_{p}is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to f

_{p}(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability P

_{p}(W

_{i}=w) that the i'th word spoken by a person p is w. And one could assume that the W

_{i}are independent. Then f

_{p}(n) could be estimated by the number of unique values of W

_{i}, for i = 1,2,...,n. Where to go from here? I don't know.