For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.(adsbygoogle = window.adsbygoogle || []).push({});

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function f_{p}(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, f_{p}(n) should be bounded from above by a constant v_{p}for each person; v_{p}is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to f_{p}(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability P_{p}(W_{i}=w) that the i'th word spoken by a person p is w. And one could assume that the W_{i}are independent. Then f_{p}(n) could be estimated by the number of unique values of W_{i}, for i = 1,2,...,n. Where to go from here? I don't know.

**Physics Forums | Science Articles, Homework Help, Discussion**

Join Physics Forums Today!

The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

# Estimating vocabulary

**Physics Forums | Science Articles, Homework Help, Discussion**