Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

How many bytes are there in a language?

  1. Jan 17, 2018 #1
    Hi everyone

    Apologies if this is the wrong place to ask, but I was wondering how much data there was in a language (let's say English).

    So for example, if you were to program a computer to speak English fluently, how much data would all the vocabulary and rules of syntax take up?

  2. jcsd
  3. Jan 17, 2018 #2

    jim mcnamara

    User Avatar

    Staff: Mentor

    This is an ad hoc answer - you can look up the size of a vocabulary for many languages. But if you read below you may see that is not enough.[

    A functional vocabulary in English is about 10000 words or less. People with specialized advanced science degrees, for example, may use 100000 words, most of them not something you or I would be expected to know.

    A functional vocabulary in Navajo is close to 100000 words. Plus, most "words" are actually small sentences with a noun and predicates, which means you can have sentences within sentences, a kind of recursion. You could know all of the base words and still not be able to talk usefully to anyone. A literal translation of mouse from Navajo to English is 'it scrabbles' or 'it makes scratching sounds'

    So, when you read a 10 page Navajo story and then read the same story in English, the English version is usually about 2/3 as long, say, 6 pages.

    So my answer to you is that the question can only be answered sensibly in very limited contexts. You ask lots of questions first, then go with an estimate. Example: Is it Medical English - huge - or newspaper English - small?

    There are also creole or pidgin languages that vary a lot from one locale or country to another. More complication. Tiwa (American Indian language) also borrows a lot from other languages for use only in certain contexts. These are sometimes called regionalisms - English has them too. Soda, coke, pop are all words that take on special meanings in local areas. "Pop" a balloon, or drink a carbonated beverage, soda pop, or coke, or pop, or just plain soda. This gets bad enough in British versus American English where you can actually use the wrong word and cause a problem for yourself. Example is the word fanny. In the Americas it means your rear end. It means something else much less polite in Britain.
  4. Jan 17, 2018 #3


    Staff: Mentor

    Another aspect is that many languages do not use the standard A thru Z alphabet like Greek or Russian or Arabic. Some don’t use an alphabet at all like Chinese. Some have multiple means of writing a word such as Japanese with Katakana, Hiragana and Kanji.

    Each of these means of writing are represented by Unicode and UTF-8 and other multibyte character sets which will affect your storage requirements. You can look up the details online.

    Next we go to the word, in Latin as an example, words carry additional information in their endings. Verbs will have an ending that indicates the tense past, future, present... and whether it’s I, you, he/she/it, we, you plural or they did the action and there are many many exceptions. These endings are a part of the rules in creating meaningful sentences.

    All this variability in each language is what makes it so difficult to do proper translation without the help of big data.

    One recent attempt to shrink it down is the Ili translator device. It does this by working one way ie you learn the sentences it has stored. You speak them into the device and it speaks out the translated senetence that was prerecorded in its memory. The sentences are carefully selected so that the only answers are yes, no or some numeric value. Great for travelers not so great for conversational settings.
  5. Jan 17, 2018 #4


    User Avatar
    2017 Award

    Staff: Mentor

    Storing full information about a language (even if we ignore regional variations) is much more than just storing word lists. What is the meaning of these words? When are they used they? We do not even have a way to uniquely write this down in a computer-readable (and therefore quantifiable) way.
  6. Jan 17, 2018 #5


    User Avatar
    Homework Helper
    Gold Member

    That is a very broad question and the answer depends on what you are really trying to do. A random sentence generator with a limited vocabulary will be tiny. The same generator using the Oxford English Dictionary will be about the size of the dictionary. If you are making a chatbot, does the chatbot have to generate syntactically correct random responses or pass a Turing Test? If you need voice generation, what kind of quality (Steven Hawking or HAL9000)?
  7. Jan 17, 2018 #6
    Just the vocabulary and the rules together won't make a language. That also includes context, meaning and such.

    To increase the difficulty of the problem: did you take into consideration the recursive aspect of the whole thing? You can describe a language only by using a language.

    I'm afraid that to get an usable answer you have to make some limits to this question first.
  8. Jan 17, 2018 #7
    I hadn't even thought of voice generation. I suppose I would want HAL9000.

    The question came to me while wishing I could download proficiency in a language Matrix-style. I'd be happy with a 10,000 word vocabulary without specialised vocab.

    I hadn't considered things such as the Turing test. I guess it'd be learned as a second language, so the robot would already be able to think and communicate in a native language.
  9. Jan 18, 2018 #8


    User Avatar
    Homework Helper
    Gold Member

    A couple of responses:
    • To your original question concerning memory requirements: Why? I would think that memory would be the least of your concerns at this stage. Amazon Echo, for example, is a small, finite-memory device. It connects to a relatively unlimited-memory cloud-based server to do all the heavy lifting.
    • Voice synthesis (text-to-speech) has evolved nicely and will continue to do so. Recently however, I hear a lot of robotic "voice overs" used in radio advertising and youtube narration that are ultimately very annoying. Soulless perhaps. Anyway Microsoft has a nice API: https://developer.microsoft.com/en-us/microsoft-edge/testdrive/demos/speechsynthesis/
    • Turning this into a translation problem is a good approach. Having a robot on hand that thinks and communicates is a huge time saver. Google, a company at the forefront of language translation, has a lot of open publications in this area: https://research.google.com/pubs/MachineTranslation.html
  10. Jan 19, 2018 #9
    Cheers. Thanks
  11. Jan 19, 2018 #10


    User Avatar
    Gold Member

    I remember years back when I was a kid there was a 30 second "Moments in History" ( or something like that ) put out by government( probably not) or related aid agency describing a language of which I do not recall.
    The ladies name ( phonetically) was Lotta Hitch Manova, and she explained that one word meant 3 things depending upon the verbal intonation.
    Phonetically again, somewhat, I will try.
    Anikte, Antiktee, Aniktey meant your wife, his wife, my wife.

    I guess I wasn't too far off

    Point is, that saying the same group of words in different intonations can designate anger, surprise, sarcasm, disbelief, ...
    That would add an extra xxx-bytes to a program and data for it to be "fluent" in a language, and comprehend an actual meaning.
  12. Jan 19, 2018 #11


    User Avatar
    Science Advisor

    Anecdote (I have lots of them).

    Way back in 1975 my research lab bought a microprocessor development system from Siemens, with manuals in both German and English. My favorite language difference was in the explanation of Error 13.
    • In German; "Die angezeichnete Datei ist in Diskettenbereich nicht eingeführt"
    • In English: "No such file"
    Disclaimer: I am not German, and I may not have remembered the exact German sentence.The English sentence is exact.
  13. Jan 19, 2018 #12
    Let me take your question literally - if nothing else, it will be a lesson in stating requirements.
    You want the program to speak (be not listen) English. I will presume that the information is being provided as text and the program simply needs to read the text aloud as a fluent English speaker would.

    So, there's an app for that (actually, scores of them):

    I am sure that many of those apps exceed all of my estimates, but here is my take:

    So it will need a dictionary that includes the correct spelling, pronunciation, and verb/noun/adj/etc indicators, and usage information for every word in its vocabulary (say 12,000 words). It will also need some rules for sentence and phrase construction. And it will need a voice data base - even if it to speak in only one voice.

    Perhaps the largest part of this data base will be information about how to pronounce the right variations of words and to impose the correct inflections on common phrases.

    Still, it is no a large amount of data. Perhaps a couple of megabytes.
    Depending on how elaborate the GUI is, the software engine itself will likely be another couple of megabytes.
    Of course, that's version 1.0. Once completed, these type of programs tend to grow.
  14. Jan 19, 2018 #13
    About that same time, perhaps a year or two before then, I was working on the ComputerVision CADDS 3 system (Computer Assisted Design and Drafting). I developed a "language" for inserting points, lines, arcs, etc into the drawing - a combination of short mnemonics (INS, DEL, POINT, LIN, ARC, ...) and digitizations (think mouse clicks).
    Once I completed it, a multilingual colleague created a German version with mnemonics based on German words. Surprisingly, the Germans preferred the English version.
  15. Jan 19, 2018 #14

    jack action

    User Avatar
    Science Advisor
    Gold Member

    A lot of the answers in this thread consider storing words and using grammar rules to assemble them. But I wonder how many sentences you would have to know to be considered «speaking fluently»?

    That is what we do with the example "No such file", we don't store 3 words and assemble them, we store the sentence.

    For example, no one ever uses the sentence "The blue strawberries fly over the spaghetti sauce.", so there would be no needs to store it in memory.
  16. Jan 19, 2018 #15

    Vanadium 50

    User Avatar
    Staff Emeritus
    Science Advisor
    Education Advisor
    2017 Award

    I don't think this is a well-defined question. What does it mean to "speak English fluently" if the computer doesn't know what to say?
  17. Jan 19, 2018 #16


    User Avatar
    2017 Award

    Staff: Mentor

    The German version tells you much more. Literally it is not very meaningful, but a sentence like "Die Diskette mit der gesuchten Datei ist nicht eingeführt" would be "The disk with the file searched for is not inserted". Now both sentences have about the same length and information content.
    "No such file" can be translated 1:1 as "Keine solche Datei".
  18. Jan 19, 2018 #17


    User Avatar
    Homework Helper
    Education Advisor
    Gold Member

    The thing about any real, HUMAN language is that a person (not even sure if any machine) needs to learn it inductively. Also, real human languages evolve, in that word meanings and phrases today will be handled using different ways 500-700 years later.
  19. Jan 20, 2018 #18


    User Avatar
    Gold Member

    A lot... the more you give the computer, the better it will learn... The better it learns, the better the its results can be.
    For example SyntaxNet is able to achieve a performance of ~94% for well-formated text or ~90% from the web.
  20. Jan 23, 2018 #19
    The scribed data is in floppy disk area not found
  21. Jan 23, 2018 #20


    User Avatar
    Science Advisor

    Yes, but no German would ever think of using such a grammatically incorrect expression.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted