How many bytes are there in a language?

Darkmisc · Jan 17, 2018

Hi everyone

Apologies if this is the wrong place to ask, but I was wondering how much data there was in a language (let's say English).

So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?Thanks

jim mcnamara · Jan 17, 2018

This is an ad hoc answer - you can look up the size of a vocabulary for many languages. But if you read below you may see that is not enough.[

A functional vocabulary in English is about 10000 words or less. People with specialized advanced science degrees, for example, may use 100000 words, most of them not something you or I would be expected to know.

A functional vocabulary in Navajo is close to 100000 words. Plus, most "words" are actually small sentences with a noun and predicates, which means you can have sentences within sentences, a kind of recursion. You could know all of the base words and still not be able to talk usefully to anyone. A literal translation of mouse from Navajo to English is 'it scrabbles' or 'it makes scratching sounds'

So, when you read a 10 page Navajo story and then read the same story in English, the English version is usually about ²/₃ as long, say, 6 pages.

So my answer to you is that the question can only be answered sensibly in very limited contexts. You ask lots of questions first, then go with an estimate. Example: Is it Medical English - huge - or newspaper English - small?

There are also creole or pidgin languages that vary a lot from one locale or country to another. More complication. Tiwa (American Indian language) also borrows a lot from other languages for use only in certain contexts. These are sometimes called regionalisms - English has them too. Soda, coke, pop are all words that take on special meanings in local areas. "Pop" a balloon, or drink a carbonated beverage, soda pop, or coke, or pop, or just plain soda. This gets bad enough in British versus American English where you can actually use the wrong word and cause a problem for yourself. Example is the word fanny. In the Americas it means your rear end. It means something else much less polite in Britain.

jedishrfu · Jan 17, 2018

Another aspect is that many languages do not use the standard A thru Z alphabet like Greek or Russian or Arabic. Some don’t use an alphabet at all like Chinese. Some have multiple means of writing a word such as Japanese with Katakana, Hiragana and Kanji.

Each of these means of writing are represented by Unicode and UTF-8 and other multibyte character sets which will affect your storage requirements. You can look up the details online.

Next we go to the word, in Latin as an example, words carry additional information in their endings. Verbs will have an ending that indicates the tense past, future, present... and whether it’s I, you, he/she/it, we, you plural or they did the action and there are many many exceptions. These endings are a part of the rules in creating meaningful sentences.

All this variability in each language is what makes it so difficult to do proper translation without the help of big data.

One recent attempt to shrink it down is the Ili translator device. It does this by working one way ie you learn the sentences it has stored. You speak them into the device and it speaks out the translated senetence that was prerecorded in its memory. The sentences are carefully selected so that the only answers are yes, no or some numeric value. Great for travelers not so great for conversational settings.

mfb · Jan 17, 2018

Storing full information about a language (even if we ignore regional variations) is much more than just storing word lists. What is the meaning of these words? When are they used they? We do not even have a way to uniquely write this down in a computer-readable (and therefore quantifiable) way.

lewando · Jan 17, 2018

Darkmisc said:

...if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?

That is a very broad question and the answer depends on what you are really trying to do. A random sentence generator with a limited vocabulary will be tiny. The same generator using the Oxford English Dictionary will be about the size of the dictionary. If you are making a chatbot, does the chatbot have to generate syntactically correct random responses or pass a Turing Test? If you need voice generation, what kind of quality (Steven Hawking or HAL9000)?

Rive · Jan 17, 2018

Darkmisc said:

So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?

Just the vocabulary and the rules together won't make a language. That also includes context, meaning and such.

To increase the difficulty of the problem: did you take into consideration the recursive aspect of the whole thing? You can describe a language only by using a language.

I'm afraid that to get an usable answer you have to make some limits to this question first.

Darkmisc · Jan 17, 2018

I hadn't even thought of voice generation. I suppose I would want HAL9000.

The question came to me while wishing I could download proficiency in a language Matrix-style. I'd be happy with a 10,000 word vocabulary without specialised vocab.

I hadn't considered things such as the Turing test. I guess it'd be learned as a second language, so the robot would already be able to think and communicate in a native language.

lewando · Jan 18, 2018

A couple of responses:

To your original question concerning memory requirements: Why? I would think that memory would be the least of your concerns at this stage. Amazon Echo, for example, is a small, finite-memory device. It connects to a relatively unlimited-memory cloud-based server to do all the heavy lifting.
Voice synthesis (text-to-speech) has evolved nicely and will continue to do so. Recently however, I hear a lot of robotic "voice overs" used in radio advertising and youtube narration that are ultimately very annoying. Soulless perhaps. Anyway Microsoft has a nice API: https://developer.microsoft.com/en-us/microsoft-edge/testdrive/demos/speechsynthesis/
Turning this into a translation problem is a good approach. Having a robot on hand that thinks and communicates is a huge time saver. Google, a company at the forefront of language translation, has a lot of open publications in this area: https://research.google.com/pubs/MachineTranslation.html

Darkmisc · Jan 19, 2018

Cheers. Thanks

256bits · Jan 19, 2018

I remember years back when I was a kid there was a 30 second "Moments in History" ( or something like that ) put out by government( probably not) or related aid agency describing a language of which I do not recall.
The ladies name ( phonetically) was Lotta Hitch Manova, and she explained that one word meant 3 things depending upon the verbal intonation.
Phonetically again, somewhat, I will try.
Anikte, Antiktee, Aniktey meant your wife, his wife, my wife.

I guess I wasn't too far off
https://en.wikipedia.org/wiki/Lotta_Hitschmanova

Point is, that saying the same group of words in different intonations can designate anger, surprise, sarcasm, disbelief, ...
That would add an extra xxx-bytes to a program and data for it to be "fluent" in a language, and comprehend an actual meaning.

Svein · Jan 19, 2018

Anecdote (I have lots of them).

Way back in 1975 my research lab bought a microprocessor development system from Siemens, with manuals in both German and English. My favorite language difference was in the explanation of Error 13.

In German; "Die angezeichnete Datei ist in Diskettenbereich nicht eingeführt"
In English: "No such file"

Disclaimer: I am not German, and I may not have remembered the exact German sentence.The English sentence is exact.

.Scott · Jan 19, 2018

Darkmisc said:

Apologies if this is the wrong place to ask, but I was wondering how much data there was in a language (let's say English).

So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?

Let me take your question literally - if nothing else, it will be a lesson in stating requirements.
You want the program to speak (be not listen) English. I will presume that the information is being provided as text and the program simply needs to read the text aloud as a fluent English speaker would.

So, there's an app for that (actually, scores of them):
https://play.google.com/store/apps/details?id=com.google.android.tts&hl=en

I am sure that many of those apps exceed all of my estimates, but here is my take:

So it will need a dictionary that includes the correct spelling, pronunciation, and verb/noun/adj/etc indicators, and usage information for every word in its vocabulary (say 12,000 words). It will also need some rules for sentence and phrase construction. And it will need a voice data base - even if it to speak in only one voice.

Perhaps the largest part of this data base will be information about how to pronounce the right variations of words and to impose the correct inflections on common phrases.

Still, it is no a large amount of data. Perhaps a couple of megabytes.
Depending on how elaborate the GUI is, the software engine itself will likely be another couple of megabytes.
Of course, that's version 1.0. Once completed, these type of programs tend to grow.

.Scott · Jan 19, 2018

Svein said:

Anecdote (I have lots of them).

Way back in 1975 my research lab bought a microprocessor development system from Siemens, with manuals in both German and English. My favorite language difference was in the explanation of Error 13.

In German; "Die angezeichnete Datei ist in Diskettenbereich nicht eingeführt"

In English: "No such file"

Disclaimer: I am not German, and I may not have remembered the exact German sentence.The English sentence is exact.

About that same time, perhaps a year or two before then, I was working on the ComputerVision CADDS 3 system (Computer Assisted Design and Drafting). I developed a "language" for inserting points, lines, arcs, etc into the drawing - a combination of short mnemonics (INS, DEL, POINT, LIN, ARC, ...) and digitizations (think mouse clicks).
Once I completed it, a multilingual colleague created a German version with mnemonics based on German words. Surprisingly, the Germans preferred the English version.

jack action · Jan 19, 2018

A lot of the answers in this thread consider storing words and using grammar rules to assemble them. But I wonder how many sentences you would have to know to be considered «speaking fluently»?

That is what we do with the example "No such file", we don't store 3 words and assemble them, we store the sentence.

For example, no one ever uses the sentence "The blue strawberries fly over the spaghetti sauce.", so there would be no needs to store it in memory.

Vanadium 50 · Jan 19, 2018

I don't think this is a well-defined question. What does it mean to "speak English fluently" if the computer doesn't know what to say?

mfb · Jan 19, 2018

Svein said:

In German; "Die angezeichnete Datei ist in Diskettenbereich nicht eingeführt"

In English: "No such file"

The German version tells you much more. Literally it is not very meaningful, but a sentence like "Die Diskette mit der gesuchten Datei ist nicht eingeführt" would be "The disk with the file searched for is not inserted". Now both sentences have about the same length and information content.
"No such file" can be translated 1:1 as "Keine solche Datei".

symbolipoint · Jan 19, 2018

Darkmisc said:

I hadn't even thought of voice generation. I suppose I would want HAL9000.

The question came to me while wishing I could download proficiency in a language Matrix-style. I'd be happy with a 10,000 word vocabulary without specialised vocab.

I hadn't considered things such as the Turing test. I guess it'd be learned as a second language, so the robot would already be able to think and communicate in a native language.

The thing about any real, HUMAN language is that a person (not even sure if any machine) needs to learn it inductively. Also, real human languages evolve, in that word meanings and phrases today will be handled using different ways 500-700 years later.

ChrisVer · Jan 20, 2018

Darkmisc said:

So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?

A lot... the more you give the computer, the better it will learn... The better it learns, the better the its results can be.
For example SyntaxNet is able to achieve a performance of ~94% for well-formated text or ~90% from the web.
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

Dr_Zinj · Jan 23, 2018

Svein said:

Anecdote (I have lots of them).

Way back in 1975 my research lab bought a microprocessor development system from Siemens, with manuals in both German and English. My favorite language difference was in the explanation of Error 13.

In German; "Die angezeichnete Datei ist in Diskettenbereich nicht eingeführt"

In English: "No such file"

Disclaimer: I am not German, and I may not have remembered the exact German sentence.The English sentence is exact.

The scribed data is in floppy disk area not found

Svein · Jan 23, 2018

mfb said:

"No such file" can be translated 1:1 as "Keine solche Datei".

Yes, but no German would ever think of using such a grammatically incorrect expression.

Dr_Zinj · Jan 23, 2018

Yeah, but I'm apparently not from real German ancestry, since it looks like most of my patrilineal descent is from Bohemia.

Tom Kunich · Jan 23, 2018

Darkmisc said:

Hi everyone

Apologies if this is the wrong place to ask, but I was wondering how much data there was in a language (let's say English).

So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?Thanks

We have ASCII that works for languages that use an English alphabet. That is 127 binary bits long or one half of one byte. The number of words that you require to cover 95% of the American English words used in everyday use is about 3,000. One of the reasons for the rapid advancement of western civilization is the ease with which complex ideas could be transferred with so little education because of this simple form or recording ideas. These 3,000 words can be put together in literally millions of different ways to express ideas as complex as "Can I have a glass of water" to "What is the nature of dark matter".

Hebrew supposedly only has 22 letters in their alphabet but they have 5 that take on a different form when used at the end of a sentence. (27 letters - isn't that odd, not. Sort of gives you a good idea where English came from.) Latin contains but 23 letters in it's earliest form but was expanded to 26 to allow writing sounds in other languages.

Eastern alphabets can hardly be called alphabets because they are generally entire words, meaning that there can be thousands of symbols in their alphabets.

trainman2001 · Jan 23, 2018

Can't tell you how large the data set has to be to master a language, but speaking from experience of trying to master German starting at the age of 53 (impossible), I can tell you that I probably had a vocabulary of between 1,000 and 2,000 words, could read some and write some, but speak much less. My only other foray into learning a second language was high school French about which I was not very good, and Hebrew which I could read, but not translate since that's about all you needed to pull off to be Bar Mitzvah in the 1950s. The problem with learning to speak German fluently had much to do with the mental gymnastics that an English speaker must do to form well-structured German sentences, where the action verb was always at the end of the sentence and the helper-verb was in the second position...always. This is not how English is constructed and is not even how we listen to someone for meaning. To wait until the sentence is complete before understanding what the action should be is difficult. The moral... vocabulary is just the tip of the iceberg to the challenge of human language.

Tom Kunich · Jan 23, 2018

There is no need to learn German. Most Germans speak English.

mfb · Jan 23, 2018

Svein said:

Yes, but no German would ever think of using such a grammatically incorrect expression.

"Datei nicht gefunden" would be a more likely error message. Either way: There is no need for a full sentence.

trainman2001 said:

The problem with learning to speak German fluently had much to do with the mental gymnastics that an English speaker must do to form well-structured German sentences, where the action verb was always at the end of the sentence and the helper-verb was in the second position...always.

It is possible to construct sentences where you have to wait for the verb very long, but these are typically examples how bad you can make it, not things you'll hear in normal conversations.

Tom Kunich said:

There is no need to learn German. Most Germans speak English.

And some do not. Or not good enough for a fluent conversation. Or you get some signs/books/... in German only. If you live in Germany without knowing German, you'll miss a lot.

Curiose · Jan 27, 2018

In the field of language computation there is "parsing" and "generating". You are speaking about generating natural language with a set of rules to control how phrasing is done, etc.

Let's use as an example, some other systems

If you think about Google Translate... What type of data does Google actually store? I'll keep this to one language.

Characters - All of the letters used in the language (somewhere near 26+13 characters if you think about it) with their own registered ID (maybe TINYINT [1 byte] on MySQL and a VARCHAR in UTF8 format), and a language ID (another TINYINT - 1 byte)
Words - All the words found used in that language (each with their own registered ID as well) depending on largest word size (could be literally a 28 character word, 1 byte for each character). You also have to take into account a possible ID for word type, and maybe some other supporting information about the word like what tense it is in, and what POV it uses.
n-Grams - All of the sequences of up to n characters (could be 10 or even 20) that are seen in training data as sequences. With this setup, you need to store the character's ID in each place... so if 20 characters, 20 bytes of character ID, plus a normal sized INT for the ID of the n-Gram. One could argue at this point that n-Grams "are" words, but they are more like statistical data about the most likely combinations of characters
Word n-Grams - All of the encountered sequences of up to n words, creating a sort of phrase statistics system. This will allow the system to know what words should come next. n-Grams work a lot like "autocomplete" on a telephone. It statistically predicts what word should follow another word. to see more about n-Grams see Google's NGram Viewer ... These are 100% necessary if you are going to process or generate natural language using your program. Also, as mentioned earlier, there are multiple dialects for each language, including slang. nGrams sort of let you "skip" creating rules. If you have large text examples to train on, whatever language/dialect/slang in the sample data that is being used but you're still using a limited corpus when a language has literally unlimited combinations of words, since words can be created, shortened, etc..
n-Gram level Synonyms, Antonyms, Relationships - This is where you start relating a word to a phrase and how they relate to each other. You might even include some information about in what dialect each n-Gram is being used in this relationship.

So imagine that you have 10,000 words in your "vocabulary" set (word table) as mentioned in a previous reply. Let's say you have a "large enough" sample data set. Your sample data includes 100,000 sentences, with an average word count per sentence at 8. That is 800,000 words of sample data. If you were to record nGrams only up until 5 words (not counting character based nGrams), you will end up with a something like this for a sentence... jack went upstairs to get his coat. (7 words)...

7 1-Grams: jack, went, upstairs, to, get, his, coat
6 2-Grams: jack went, went upstairs, upstairs to, to get, get his, his coat
5 3-Grams: jack went upstairs, went upstairs to, upstairs to get, to get his, get his coat
4 4-Grams: ...
3 5-Grams...

Given that each of these combinations take up the same amount of data (11 bytes x 5 + supporting data 1 byte), it means the one sentence generated 7 x 6 x 5 x 4 x 3 x 11 x 6 = 166,320 bytes of data.

Now, the re-use of some of these n-Grams are very consistent... for instance "went to" would probably be used a lot in a training corpus, so that saves us some data since we only have to store which sentences this n-Gram showed up in (22 bytes). Instead of using an INT and taking up 11 bytes per ID, we might also use a smaller type of datum such that we compress the size of training data considerably. You might consider that the total size of stored n-grams and their links for an 800,000 word training corpus would be somewhere between 10MB and 30MB. This is considering that your corpus probably covers most of the most used sentences in said language.

Over the years, many have done similar things. One good reference is the Stanford NLP library. If I recall, their CoreNLP system (not including many of their other things for processing language), was somewhere around 150MB... and to actually generate language from what they give you, you'd still have to use it in a way specific to generating language rather than just parsing it.

I'd estimate, that depending on your purposes, you could get away with storing english, along with a statistical set of reduced rules within 100MB, that is an optimistic estimate. In the end, it depends on what purposes you're using it for, as well as how accurate you want it to be.

Curiose · Jan 27, 2018

LOL, not to mention, to parse lagauge is one thing, to generate it, you have to have some sort of intention of the communication. If you wanted to make a program that managed a house, you don't really need the program to fully encode english... you just need some well crafted messages for specific events ;)

Mark44 · Jan 27, 2018

Tom Kunich said:

We have ASCII that works for languages that use an English alphabet. That is 127 binary bits long or one half of one byte.

No, you are confusing two separate things. ASCII codes for letters are eight bits, not 127. As an unsigned (i.e., nonnegative) number, eight bits can represent any integer in the range 0 through 255. As a signed number, eight bits can represent any integer in the range -128 through 127.

symbolipoint · Jan 27, 2018

Darkmisc said:

Hi everyone

Apologies if this is the wrong place to ask, but I was wondering how much data there was in a language (let's say English).

So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?Thanks

You talking of Human Languages?

Someone once said something about "Corinthians,.." can't remember the exact numeric identifier; but the Word is Alive; the Letter Kills the Spirit.

Your question is easy enough to ask, but something is seriously wrong with the concept.

How many bytes are there in a language?

Similar threads

Hot Threads

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Python Complaining About Python

Fortran Reading files in pre-f77 - handling end of file

Sequential Analog Computers?

Recent Insights

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem