Probability that a Random String is a Word

WWGD · Dec 4, 2018

Hi,
Say L is a human language (e.g. German, Chinese, etc.) and w is a string in L of length n>1. Is it known for different languages what the probability is that w is a word in L? And if S is an ordered set of strings, the probability that S is grammatically correct in L? I mean, I know or have a good idea how to answer this _if_ I had access to the right database. If being the key word here. Mayb this has to see with entropy?
Thanks.

RPinPA · Dec 4, 2018

I suppose you could estimate that from estimates of the number of words in the language.

For instance, the Oxford English Dictionary website gives the following estimates: "The Second Edition of the 20-volume Oxford English Dictionary, published in 1989, contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries" but also cautions that it's hard to define what counts as a word. So let's use their total of 171476 + 47156 + 9500 = 228132 words.

There are 26 letters in English. So there are ##26^n## strings of length ##n##, and ##\sum_{k=1}^{n} 26^k = \frac {26}{25} (26^n - 1)## strings of lengths 1 to ##n##.

Now, in the OED's count, while there are a few words > 25 letters, those are pretty rare. We'd probably be more interested in a lower cutoff. So what's the distribution of word length in English?

I found this chart which you may find helpful for your question. It has a distribution of word lengths for many languages.

According to that chart, 0.2% of words in English are 19 characters; 0.1% are 20 characters. If we take ##n=20## as the cutoff, then there are ##2.07 \times 10^{28}## strings of lengths 1 to 20, and the probability that one of those is the 228132 / ##2.07 \times 10^{28} = 1.1 \times 10^{-23}##

The probability will go up as you consider shorter strings. If I look at that chart and take his "average" of 8.23 as the median, so roughly half the words are of length 8 or less, then there are only ##2.17 \times 10^{11}## random strings up to length 8, and the probability that one of them is a word is about 114000/##2.17 \times 10^{11} = 6.6 \times 10^{-7}##, or a little over 2 in a million. And you can do similar calculations for other languages.

Now you might go on to ask what the probability is of a long string containing a shorter word somewhere within it.

Buzz Bloom · Dec 4, 2018

RPinPA said:

According to that chart, 0.2% of words in English are 19 characters; 0.1% are 20 characters. If we take n=20n=20 as the cutoff, then there are 2.07×10282.07 \times 10^{28} strings of lengths 1 to 20,

Hi RP:

While your approach gives a reasonable feel for answering the OP's question, I think there is something not quite correct in the way you did the calculation.
Given that there are (E) 228132 English words, and that the number (n) of different strings of lengths 1 to 20 is 2.07×10²⁸, I think n/N is not the correct way to answer the OP's question. I think you need to take into account the actual number of words of length k,

k=1 to 20.

I will try to explain the using an exaggerated example.

Suppose the 228132 words had the following (bizarre) distribution: assume all combinations of k letters were actual words for lengths k =1,2,3, and, with respect to the combination of 4 letters, the remainder of the 228132 words were all of length 4. What you would now have are:

26 words of length 1
676 words of length 2
17,576 words of length 3,
209,854 words of length 4, and

no words of length longer than 4.
Also:

the probability of a random string of length 1 being a word is 100%,
the probability of a random string of length 2 being a word is 100%,
the probability of a random string of length 3 being a word is 100%, and
the probability of a random string of length 4 being a word is 46%.

Finally, the probability of a random word of length <5 being a word is 48%.

BTW, I clicked on your "this chart" link and failed to find any chart.

Regards,
Buzz

Klystron · Dec 4, 2018

Reached Ravi's website using the "chart link" but not the data.

Can we consider post #3 as refining the initial cut by RP in #2? Simple rules can eliminate many of the non-language strings such as consecutive repetitions of identical symbols past 3 repetitions, without knowing the language.

I've worked with language databases with functions that provided statistical inference of which letter or letter groups follow a partial input string. Obvious example from modern English: letter "u" likely follows letter "q". I can research current sources and references if needed. Input string auto-completion is ubiquitous in browser software, search engines, etc.

RPinPA · Dec 4, 2018

Not sure what happened to the link, but here's the correct one. It's a nice interactive chart.

http://www.ravi.io/language-word-lengths

My reasoning was very simple-minded: if all random strings from length 1 to n are equally probable, then clearly the probability of getting a word is (# of words) / (# of possible strings).

Buzz Bloom said:

Finally, the probability of a random word of length <5 being a word is 48%.

Sure. But the vast majority of the random strings, those from 5-20 characters in length, are not words in this language. Therefore the probability of a random string being a word is infinitesimal, about half of the probability that the random string is less than 5 characters.

To be precise, the number of strings of length 1 to 4 is 475254. So the probability that your random string is going to be one of those is 475254 / ##2.07 \times 10^{28} = 2.29 \times 10^{-23}##. And only half of those are words.

The flaw is my assumption that all strings are equally probable, and I think that's what you're getting at. We'd need to know more about how the strings are generated "randomly". A better model, and again I think this is implicitly what you're getting at is this:

Let W = the event that the string is a word.
Let N = the length of the string.
Then ##P(W) = P(W|N=1) P(N = 1) + P(W|N=2) P(N = 2) + ... + P(W|N=n) P(N = n)##

In your example language all of the probabilities ##P(N = k)## are 0 for ##k \geq 5##. But the probabilities ##P(N = 1)## through ##P(N = 4)## are extremely small. But an actual random-generation procedure, especially one that checks as it goes to see if it's formed a word yet, may have a strong bias toward short strings.

Buzz Bloom · Dec 4, 2018

RPinPA said:

if all random strings from length 1 to n are equally probable

Hi RP:

If you will do the chore of posting the list of pairs

[k = number letters, p(k) = percent of words with k letters)]

from the chart for English in a copy-able form, I will do the chore of calculating the corresponding weighted probability that a random sequence of letters will be a word.

Regards,
Buzz

Klystron · Dec 4, 2018

Agree with chart author that the values need to be weighted by word frequency in each L. Good work Ravi et al.

RPinPA · Dec 5, 2018

RPinPA said:

In your example language all of the probabilities ##P(N = k)## are 0 for ##k \geq 5##.

That was a misstatement. In Buzz Bloom's example, it is the probabilities ##P(W | N = k)## which are 0 for ##k \geq 5##. The probabilities ##P(W | N = k)## are large for ##k \lt 5##, 0.46 to 1.00. But my point is that ##P(N = k)## is on the order of ##10^{-23}## for ##k < 5##, making the total probability ##P(W)## extremely small if all strings of length 1 to 20 are equally probable.

Buzz Bloom said:

If you will do the chore of posting the list of pairs
[k = number letters, p(k) = percent of words with k letters)]

OK, here you go. I did this manually but I'm pretty sure I got it right.
k p(k) (%)
1 0.0
2 0.1
3 0.6
4 2.6
5 5.2
6 8.5
7 12.2
8 14.0
9 14.0
10 12.6
11 10.1
12 7.5
13 5.2
14 3.2
15 2.0
16 1.0
17 0.6
18 0.3
19 0.2
20 0.1
21 0.1
22 0.0

Buzz Bloom · Dec 6, 2018

RPinPA said:

OK, here you go. I did this manually but I'm pretty sure I got it right.

Hi RP:

I did a spot check of your numbers for the values for

f(k) = the probability a randomly selected word will have k letters,

and I found no errors. However, the chart has a value of "0" for f(1). Technically this is correct, but it gives the wrong answer to the following question.

What is the probability p(k) of randomly picking a single letter (k=1) which will be a word of length one?

There are two words of length 1, ("a" and "i") making

p(1) = 7.7%

rather than p(1) = 0.

Except for k=1, I calculated

p(k) = f(k) × N / 2²⁶,

where N = 228132, the assumed number of English words. I obtained the following values:

p(2) = 33.7%
p(3) = 7.8%
p(4) = 1.3%
p(k) [k = 5...22] = 0.0%.

The average value A = Avg( p(k) [k=1...22] ) = 2.3%.

This method of calculating A assumes one of the many possible ways to randomly choose a string of letters.

First roll a 20-sided Icosahedron die with the numbers 1...20 to select a value of k in [1...20]. Then randomly choose a random sequence of k letters.

I am aware that this is a completely arbitrary way of randomly choosing a string of letters. However, the method described in post #2 is also arbitrary. The point I am trying to make is that any method of choosing a random string is arbitrary. There is no single "correct" answer to the question,

What is the mathematically correct way to chose a random string of letters?

This reminds me of a practical problem my wife needed to solve regarding choosing a method to doing accounting to calculate the asset value of our house. The brilliant answer she came up with is:

When choosing a method of accounting, choose the one that makes you happy.

The only mathematical principle that applies to the random choice of a letter sequence is that two separate steps are needed.

1. Randomly chose a value k for the number of letters.
2. Randomly choose a string of k letters.

Step 2 is mathematically easy. Step 1 is arbitrary. As an example, I throw into the mix one more (arbitrary) way to do step 1.

Choose a random word from a dictionary. (Choosing a dictionary is also arbitrary.) Choose as the value of k the length of the chosen word.

Regards,
Buzz

RPinPA · Dec 6, 2018

Buzz Bloom said:

However, the chart has a value of "0" for f(1).

Yes, since there are only two 1-letter words in English and 220000 words total, that is less than 0.05% so it rounds down to 0.

Buzz Bloom said:

The point i am trying to make is that any method of choosing a random string is arbitrary. There is no single "correct" answer to the question,
What is the mathematically correct way to chose a random string of letters?

Agreed. Let's assign all the strings of length 1-20 a natural number ##n## up to the total count ##N##. In other words, we'll put them in a list. The 1-letter strings are numbered 1-26, the two-lettered strings are numbered 27-(26 + 26^2), etc. When we construct a "random string" selecting procedure such as your 20-sided die, it is a lot more likely that an intuitive procedure will come up with a non-uniform distribution biased toward short strings than a uniform one on {1, 2, ..., N}

Buzz Bloom said:

The only mathematical principle that applies to the random choice of a letter sequence is that two separate steps are needed.
1. Randomly chose a value k for the number of letters.
2. Randomly choose a string of k letters.

But you don't even have to make it a two-step procedure. Assuming you have a way of drawing from a uniform distribution from 1 to ##N##, you could just use it.

And here's yet another:
1. Choose a letter a-z for position 1.
2. For all subsequent positions, choose randomly from {a, b, ..., z, <space>}.
3. If you chose a space, terminate the string. If the string is 20 characters long, terminate the string.
4. Otherwise go back to step 2.

Now the length will follow a geometric distribution.

And then there are procedures you could cook up that use language knowledge. For instance, generate the string letter by letter but have a rule like "stop if you have a word" (which is going to leave out a lot of words which embed smaller words at the start). Or "stop if you know already there's no matching word", for instance the string began ZG. That will eliminate most of the long nonsense words and probably increase the probability significantly of finding a word randomly.

Anyway, you're right that there is no one way to "choose a random string".

RPinPA · Dec 6, 2018

I just want to say congratulations to the OP on a really, really interesting question that has opened up many other interesting questions. Sorry we couldn't actually answer your question :-) But in fact if you want to come up with a specific model for how you're generating random strings, we can help analyze it.

BTW I wrote Ravi asking him about his data. His response was "The data was from some word list / dictionaries I found for each language, but unfortunately I built this ~6 years ago and no longer have access to the data or a recollection of where I found it."

Buzz Bloom · Dec 6, 2018

RPinPA said:

When we construct a "random string" selecting procedure such as your 20-sided die, it is a lot more likely that an intuitive procedure will come up with a non-uniform distribution biased toward short strings than a uniform one on {1, 2, ..., N}

Hi RP:

Perhaps I should have noted that all of the arbitrary methods for selecting random strings are biased in some respect. The problem I have with choosing a value N and then choosing a random string of length <= N is that N is arbitrary. You chose an arbitrary value of N=20. Take a look at

https://en.wikipedia.org/wiki/Pneumonoultramicroscopicsilicovolcanoconiosis .

This shows that using an arbitrary choice of dictionary, and then choosing N as the length of this dictionary's largest word, can create a very an extremely smaller final probability value than your arbitrary choice of N=20. (If I counted correctly the Oxford dictionary's largest word gives the value of N=45.)

Your suggestion to use a geometric distribution I think is also a "fair" method. But your method arbitrarily chooses the random value of k in terms of powers of (1/27). I think that this is much less biased in favor of small values of k than the flat distribution of values of k (=1...22) I used in my Post#10 example. Perhaps a "fairer" choice would be a geometric distribution using powers of (1/2).

I think that a very reasonable argument can be made that the last method I described in post #10

Buzz Bloom said:

Choose a random word from a dictionary. (Choosing a dictionary is also arbitrary.) Choose as the value of k the length of the chosen word.

is the "most fair" method.

Regards,
Buzz

Klystron · Dec 6, 2018

Fascinating thread. Thanks for the exercise

.

Some additional notes on the process of recognizing words (W) in randomly generated character strings.

n = 1 character words: the OP (wisely IMO) restricted n > 1. So, investigating 'string function' performance at n = 1 is useful but outside original requirements. [But see below concerning pictograms.]

If we have access to a dictionary for each language containing exhaustive examples of all words in each language, we can also assume a reasonable random number generator rand () that returns a psuedo-random number between 0 and 1. Also, dictionaries contain the meaning of words -- a knowledge base. Simpler character and string recognition tables for each L should suffice.

Thread failed to develope S "an ordered set of W's". If we apply meaning to each w in L, how should we apply meaning to each accepted S?

Likewise, each primary human language L has enormous influence on written word length. Consider the OP's original examples Chinese and German. Chinese language can be written in pictographs as well as letters. Is word character length even applicable compared to Frisian derived languages such as German and English?
Would converting letter to character to symbol help resolve the "character count" discrepancy among languages?

Great thread. Got to go (and I haven't even mentioned random variables W and S !).

WWGD · Dec 7, 2018

Thank you all for your answers. To explain the background, a group of friends and I went to an Italian place and we thought it would be fun if we made out random dish names see if we "hit" /guessed a real dish in the process, e.g.: I want a puscavine with fertocci in a muscavana sauce and some pacevale . Of course, for wine I always order some Caumanure;). We then completely forgot to do any computations after we were asked to leave;).Of course, we could make simplifying assumptions on having at least one vowel ( since we were not at a Polish restaurant after all ;)).

WWGD · Dec 10, 2018

Sorry for the weird story. I will see if we can find similarities in the distribution between related languages. Maybe these are language(family) invariants.

RPinPA · Dec 11, 2018

WWGD said:

Sorry for the weird story. I will see if we can find similarities in the distribution between related languages. Maybe these are language(family) invariants.

It's a fun story, and it adds a lot of information to the calculation. Your particular random-string generation algorithm is a much more limited population than most of us were thinking of. You're stringing together plausible Italian syllables, and I would guess you're almost always using three or four syllables. Very different from the letter-by-letter string generation I was thinking of, and much more likely to produce an actual word.

Estimating how likely would require a different sort of database, one that let you count syllables.

Klystron · Dec 11, 2018

Random phonemes provide more information and ability to identify patterns than character strings. The word database now includes representative sound files, physical measured waveforms. Gives the ability to identify the spoken language based on acoustics given enough "true" words.

More information increases complexity and ambiguity. Acoustics doesn't discriminate phonemes for English words/letters "C", "c", "sea", "see", Latin "si". Speech consists of more than "strings of phonemes". Interval (between sounds), accent, adjacent and dependent phonemes enhance meaning reducing ambiguity. The S database in L allow these measurements. And grammars.

pbuk · Dec 14, 2018

I'm surprised no-one has given the correct answer to the original question, which is trivial:

There are an infinite number of strings of length n > 1 and only a finite number of words in any human language L. The probability that a string is a word in L is therefore 0. Similarly, the probability that each member of an ordered set of strings is a word is 0 and as any ordered set containing a non-word cannot be a grammatically well-formed sentence of L, this probablility is also 0.

References:
https://web.archive.org/web/2013012...et/experiments/notes/publication/NOTES_EN.pdf

RPinPA · Dec 14, 2018

Not quite trivial. What probability distribution are you drawing from? There's no such thing as a uniform distribution on all natural numbers.

WWGD · Dec 14, 2018

pbuk said:

I'm surprised no-one has given the correct answer to the original question, which is trivial:

There are an infinite number of strings of length n > 1 and only a finite number of words in any human language L. The probability that a string is a word in L is therefore 0. Similarly, the probability that each member of an ordered set of strings is a word is 0 and as any ordered set containing a non-word cannot be a grammatically well-formed sentence of L, this probablility is also 0.

References:
https://web.archive.org/web/2013012...et/experiments/notes/publication/NOTES_EN.pdf

But you are also not considering the question as I asked it. I am trying to filter by string length. And I would obviously not consider , e.g., strings of length 15+ as potential words ( except possibly in German).

pbuk · Dec 15, 2018

RPinPA said:

Not quite trivial. What probability distribution are you drawing from? There's no such thing as a uniform distribution on all natural numbers.

Yes this is true, what I should have said is the the frequency of occurrence of words among all strings is 0. The probability of selecting a word does of course depend on the distibution of selection of strings - for instance if we choose a distribution such that P(length = 1) = 0.5 then the probability of selecting a word is at least 1/13 because of 'I' and 'a'.

The same applies to the frequency of occurrence of well-formed sentences.

WWGD said:

But you are also not considering the question as I asked it. I am trying to filter by string length. And I would obviously not consider , e.g., strings of length 15+ as potential words ( except possibly in German).

Ah yes, my mistake, you did say ' length n > 1'.

Probability that a Random String is a Word

Graduate Expected numbers of cards of a last color remaining

Graduate Probability puzzle

Undergrad The problem of points

Undergrad The countability paradox of computable numbers

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Probability that a Random String is a Word

Similar threads