How many bytes are there in a language?

Hi everyone

Apologies if this is the wrong place to ask, but I was wondering how much data there was in a language (let's say English).

So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?


Thanks
 

jim mcnamara

Mentor
3,356
1,574
This is an ad hoc answer - you can look up the size of a vocabulary for many languages. But if you read below you may see that is not enough.[

A functional vocabulary in English is about 10000 words or less. People with specialized advanced science degrees, for example, may use 100000 words, most of them not something you or I would be expected to know.

A functional vocabulary in Navajo is close to 100000 words. Plus, most "words" are actually small sentences with a noun and predicates, which means you can have sentences within sentences, a kind of recursion. You could know all of the base words and still not be able to talk usefully to anyone. A literal translation of mouse from Navajo to English is 'it scrabbles' or 'it makes scratching sounds'

So, when you read a 10 page Navajo story and then read the same story in English, the English version is usually about 2/3 as long, say, 6 pages.

So my answer to you is that the question can only be answered sensibly in very limited contexts. You ask lots of questions first, then go with an estimate. Example: Is it Medical English - huge - or newspaper English - small?

There are also creole or pidgin languages that vary a lot from one locale or country to another. More complication. Tiwa (American Indian language) also borrows a lot from other languages for use only in certain contexts. These are sometimes called regionalisms - English has them too. Soda, coke, pop are all words that take on special meanings in local areas. "Pop" a balloon, or drink a carbonated beverage, soda pop, or coke, or pop, or just plain soda. This gets bad enough in British versus American English where you can actually use the wrong word and cause a problem for yourself. Example is the word fanny. In the Americas it means your rear end. It means something else much less polite in Britain.
 
10,098
3,639
Another aspect is that many languages do not use the standard A thru Z alphabet like Greek or Russian or Arabic. Some don’t use an alphabet at all like Chinese. Some have multiple means of writing a word such as Japanese with Katakana, Hiragana and Kanji.

Each of these means of writing are represented by Unicode and UTF-8 and other multibyte character sets which will affect your storage requirements. You can look up the details online.

Next we go to the word, in Latin as an example, words carry additional information in their endings. Verbs will have an ending that indicates the tense past, future, present... and whether it’s I, you, he/she/it, we, you plural or they did the action and there are many many exceptions. These endings are a part of the rules in creating meaningful sentences.

All this variability in each language is what makes it so difficult to do proper translation without the help of big data.

One recent attempt to shrink it down is the Ili translator device. It does this by working one way ie you learn the sentences it has stored. You speak them into the device and it speaks out the translated senetence that was prerecorded in its memory. The sentences are carefully selected so that the only answers are yes, no or some numeric value. Great for travelers not so great for conversational settings.
 
32,257
8,231
Storing full information about a language (even if we ignore regional variations) is much more than just storing word lists. What is the meaning of these words? When are they used they? We do not even have a way to uniquely write this down in a computer-readable (and therefore quantifiable) way.
 

lewando

Homework Helper
Gold Member
1,313
118
...if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?
That is a very broad question and the answer depends on what you are really trying to do. A random sentence generator with a limited vocabulary will be tiny. The same generator using the Oxford English Dictionary will be about the size of the dictionary. If you are making a chatbot, does the chatbot have to generate syntactically correct random responses or pass a Turing Test? If you need voice generation, what kind of quality (Steven Hawking or HAL9000)?
 
1,109
505
So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?
Just the vocabulary and the rules together won't make a language. That also includes context, meaning and such.

To increase the difficulty of the problem: did you take into consideration the recursive aspect of the whole thing? You can describe a language only by using a language.

I'm afraid that to get an usable answer you have to make some limits to this question first.
 
I hadn't even thought of voice generation. I suppose I would want HAL9000.

The question came to me while wishing I could download proficiency in a language Matrix-style. I'd be happy with a 10,000 word vocabulary without specialised vocab.

I hadn't considered things such as the Turing test. I guess it'd be learned as a second language, so the robot would already be able to think and communicate in a native language.
 

lewando

Homework Helper
Gold Member
1,313
118
A couple of responses:
  • To your original question concerning memory requirements: Why? I would think that memory would be the least of your concerns at this stage. Amazon Echo, for example, is a small, finite-memory device. It connects to a relatively unlimited-memory cloud-based server to do all the heavy lifting.
  • Voice synthesis (text-to-speech) has evolved nicely and will continue to do so. Recently however, I hear a lot of robotic "voice overs" used in radio advertising and youtube narration that are ultimately very annoying. Soulless perhaps. Anyway Microsoft has a nice API: https://developer.microsoft.com/en-us/microsoft-edge/testdrive/demos/speechsynthesis/
  • Turning this into a translation problem is a good approach. Having a robot on hand that thinks and communicates is a huge time saver. Google, a company at the forefront of language translation, has a lot of open publications in this area: https://research.google.com/pubs/MachineTranslation.html
 
Cheers. Thanks
 

256bits

Gold Member
2,671
683
I remember years back when I was a kid there was a 30 second "Moments in History" ( or something like that ) put out by government( probably not) or related aid agency describing a language of which I do not recall.
The ladies name ( phonetically) was Lotta Hitch Manova, and she explained that one word meant 3 things depending upon the verbal intonation.
Phonetically again, somewhat, I will try.
Anikte, Antiktee, Aniktey meant your wife, his wife, my wife.

I guess I wasn't too far off
https://en.wikipedia.org/wiki/Lotta_Hitschmanova

Point is, that saying the same group of words in different intonations can designate anger, surprise, sarcasm, disbelief, ...
That would add an extra xxx-bytes to a program and data for it to be "fluent" in a language, and comprehend an actual meaning.
 

Svein

Science Advisor
Insights Author
1,912
571
Anecdote (I have lots of them).

Way back in 1975 my research lab bought a microprocessor development system from Siemens, with manuals in both German and English. My favorite language difference was in the explanation of Error 13.
  • In German; "Die angezeichnete Datei ist in Diskettenbereich nicht eingeführt"
  • In English: "No such file"
Disclaimer: I am not German, and I may not have remembered the exact German sentence.The English sentence is exact.
 

.Scott

Homework Helper
2,164
681
Apologies if this is the wrong place to ask, but I was wondering how much data there was in a language (let's say English).

So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?
Let me take your question literally - if nothing else, it will be a lesson in stating requirements.
You want the program to speak (be not listen) English. I will presume that the information is being provided as text and the program simply needs to read the text aloud as a fluent English speaker would.

So, there's an app for that (actually, scores of them):
https://play.google.com/store/apps/details?id=com.google.android.tts&hl=en

I am sure that many of those apps exceed all of my estimates, but here is my take:

So it will need a dictionary that includes the correct spelling, pronunciation, and verb/noun/adj/etc indicators, and usage information for every word in its vocabulary (say 12,000 words). It will also need some rules for sentence and phrase construction. And it will need a voice data base - even if it to speak in only one voice.

Perhaps the largest part of this data base will be information about how to pronounce the right variations of words and to impose the correct inflections on common phrases.

Still, it is no a large amount of data. Perhaps a couple of megabytes.
Depending on how elaborate the GUI is, the software engine itself will likely be another couple of megabytes.
Of course, that's version 1.0. Once completed, these type of programs tend to grow.
 

.Scott

Homework Helper
2,164
681
Anecdote (I have lots of them).

Way back in 1975 my research lab bought a microprocessor development system from Siemens, with manuals in both German and English. My favorite language difference was in the explanation of Error 13.
  • In German; "Die angezeichnete Datei ist in Diskettenbereich nicht eingeführt"
  • In English: "No such file"
Disclaimer: I am not German, and I may not have remembered the exact German sentence.The English sentence is exact.
About that same time, perhaps a year or two before then, I was working on the ComputerVision CADDS 3 system (Computer Assisted Design and Drafting). I developed a "language" for inserting points, lines, arcs, etc into the drawing - a combination of short mnemonics (INS, DEL, POINT, LIN, ARC, ...) and digitizations (think mouse clicks).
Once I completed it, a multilingual colleague created a German version with mnemonics based on German words. Surprisingly, the Germans preferred the English version.
 

jack action

Science Advisor
Insights Author
Gold Member
1,794
854
A lot of the answers in this thread consider storing words and using grammar rules to assemble them. But I wonder how many sentences you would have to know to be considered «speaking fluently»?

That is what we do with the example "No such file", we don't store 3 words and assemble them, we store the sentence.

For example, no one ever uses the sentence "The blue strawberries fly over the spaghetti sauce.", so there would be no needs to store it in memory.
 

Vanadium 50

Staff Emeritus
Science Advisor
Education Advisor
22,378
4,707
I don't think this is a well-defined question. What does it mean to "speak English fluently" if the computer doesn't know what to say?
 
32,257
8,231
  • In German; "Die angezeichnete Datei ist in Diskettenbereich nicht eingeführt"
  • In English: "No such file"
The German version tells you much more. Literally it is not very meaningful, but a sentence like "Die Diskette mit der gesuchten Datei ist nicht eingeführt" would be "The disk with the file searched for is not inserted". Now both sentences have about the same length and information content.
"No such file" can be translated 1:1 as "Keine solche Datei".
 

symbolipoint

Homework Helper
Education Advisor
Gold Member
5,279
804
I hadn't even thought of voice generation. I suppose I would want HAL9000.

The question came to me while wishing I could download proficiency in a language Matrix-style. I'd be happy with a 10,000 word vocabulary without specialised vocab.

I hadn't considered things such as the Turing test. I guess it'd be learned as a second language, so the robot would already be able to think and communicate in a native language.
The thing about any real, HUMAN language is that a person (not even sure if any machine) needs to learn it inductively. Also, real human languages evolve, in that word meanings and phrases today will be handled using different ways 500-700 years later.
 

ChrisVer

Gold Member
3,330
434
92
28
Anecdote (I have lots of them).

Way back in 1975 my research lab bought a microprocessor development system from Siemens, with manuals in both German and English. My favorite language difference was in the explanation of Error 13.
  • In German; "Die angezeichnete Datei ist in Diskettenbereich nicht eingeführt"
  • In English: "No such file"
Disclaimer: I am not German, and I may not have remembered the exact German sentence.The English sentence is exact.
The scribed data is in floppy disk area not found
 

Svein

Science Advisor
Insights Author
1,912
571
"No such file" can be translated 1:1 as "Keine solche Datei".
Yes, but no German would ever think of using such a grammatically incorrect expression.
 

The Physics Forums Way

We Value Quality
• Topics based on mainstream science
• Proper English grammar and spelling
We Value Civility
• Positive and compassionate attitudes
• Patience while debating
We Value Productivity
• Disciplined to remain on-topic
• Recognition of own weaknesses
• Solo and co-op problem solving
Top