In the field of language computation there is "parsing" and "generating". You are speaking about generating natural language with a set of rules to control how phrasing is done, etc.
Let's use as an example, some other systems
If you think about Google Translate... What type of data does Google actually store? I'll keep this to one language.
- Characters - All of the letters used in the language (somewhere near 26+13 characters if you think about it) with their own registered ID (maybe TINYINT [1 byte] on MySQL and a VARCHAR in UTF8 format), and a language ID (another TINYINT - 1 byte)
- Words - All the words found used in that language (each with their own registered ID as well) depending on largest word size (could be literally a 28 character word, 1 byte for each character). You also have to take into account a possible ID for word type, and maybe some other supporting information about the word like what tense it is in, and what POV it uses.
- n-Grams - All of the sequences of up to n characters (could be 10 or even 20) that are seen in training data as sequences. With this setup, you need to store the character's ID in each place... so if 20 characters, 20 bytes of character ID, plus a normal sized INT for the ID of the n-Gram. One could argue at this point that n-Grams "are" words, but they are more like statistical data about the most likely combinations of characters
- Word n-Grams - All of the encountered sequences of up to n words, creating a sort of phrase statistics system. This will allow the system to know what words should come next. n-Grams work a lot like "autocomplete" on a telephone. It statistically predicts what word should follow another word. to see more about n-Grams see Google's NGram Viewer ... These are 100% necessary if you are going to process or generate natural language using your program. Also, as mentioned earlier, there are multiple dialects for each language, including slang. nGrams sort of let you "skip" creating rules. If you have large text examples to train on, whatever language/dialect/slang in the sample data that is being used but you're still using a limited corpus when a language has literally unlimited combinations of words, since words can be created, shortened, etc..
- n-Gram level Synonyms, Antonyms, Relationships - This is where you start relating a word to a phrase and how they relate to each other. You might even include some information about in what dialect each n-Gram is being used in this relationship.
So imagine that you have 10,000 words in your "vocabulary" set (word table) as mentioned in a previous reply. Let's say you have a "large enough" sample data set. Your sample data includes 100,000 sentences, with an average word count per sentence at 8. That is 800,000 words of sample data. If you were to record nGrams only up until 5 words (not counting character based nGrams), you will end up with a something like this for a sentence... jack went upstairs to get his coat. (7 words)...
7 1-Grams: jack, went, upstairs, to, get, his, coat
6 2-Grams: jack went, went upstairs, upstairs to, to get, get his, his coat
5 3-Grams: jack went upstairs, went upstairs to, upstairs to get, to get his, get his coat
4 4-Grams: ...
3 5-Grams...
Given that each of these combinations take up the same amount of data (11 bytes x 5 + supporting data 1 byte), it means the one sentence generated 7 x 6 x 5 x 4 x 3 x 11 x 6 = 166,320 bytes of data.
Now, the re-use of some of these n-Grams are very consistent... for instance "went to" would probably be used a lot in a training corpus, so that saves us some data since we only have to store which sentences this n-Gram showed up in (22 bytes). Instead of using an INT and taking up 11 bytes per ID, we might also use a smaller type of datum such that we compress the size of training data considerably. You might consider that the total size of stored n-grams and their links for an 800,000 word training corpus would be somewhere between 10MB and 30MB. This is considering that your corpus probably covers most of the most used sentences in said language.
Over the years, many have done similar things. One good reference is the
Stanford NLP library. If I recall, their CoreNLP system (not including many of their other things for processing language), was somewhere around 150MB... and to actually generate language from what they give you, you'd still have to use it in a way specific to generating language rather than just parsing it.
I'd estimate, that depending on your purposes, you could get away with storing english, along with a statistical set of reduced rules within 100MB, that is an optimistic estimate. In the end, it depends on what purposes you're using it for, as well as how accurate you want it to be.