Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

A set of all possible tweets

  1. Apr 17, 2012 #1


    User Avatar

    Hello PF,

    I have a question.
    I am an artist working with big numbers and language. I created a twitter account that will post all possible tweets. (Edit: link removed)

    I am interesting in calculating the probability of specific tweets. Example: what is the probability that my next tweet will be "End here. Us then. Finn, again! Take. Bussoftlhee, mememormee!" or just gibberish like "j^F9c@# 64l[". Is this doable? I want to use this to reduce other tweets on twitter to probabilities, I like the idea of converting meaningful language into a number. Also, How do I calculate the probability of a specific sequence of tweets?

    And ok... I realize that these numbers are really really big, that the probability is basically zero, but I am not interested in reality.

    Last edited by a moderator: Apr 18, 2012
  2. jcsd
  3. Apr 19, 2012 #2
    If you're asking about the number of possible arrangements of characters in a 140 character tweet, the answer is fairly straightforward. Given an alphabet of k characters (including spaces), the number of possible arrangements is [itex] k^{140} [/itex]. Of course, most of these arrangements will be nonsense. The probability of any specific sequence is just [itex] 1/k^{140}[/itex] assuming every character has an equal probability of occurring. Obviously the problem is more complicated if this assumption doesn't hold.
    Last edited: Apr 19, 2012
  4. Apr 19, 2012 #3
    Even if we assume that the probability of each character occuring is the same and that the probabilities are independent of what other characters are, we still need to consider tweets with <140 characters. E.g. there are k100 100-character tweets.
    Therefore, the total amount of all possible non-empty tweets is [itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] ki, and the probability of observing any given tweet is then 1/[itex]\sum[/itex] [itex]^{140}_{i=1}[/itex] ki.
  5. Apr 19, 2012 #4
    Don't forget that urls of the form http://anything [Broken] are automatically shortened by Twitter. See http://support.twitter.com/entries/109623.

    This is now a much more interesting problem, since you have to calculate the probability that a random string contains a syntactically correct url.
    Last edited by a moderator: May 5, 2017
  6. Apr 19, 2012 #5
    I believe it can be shown that, by including a space as a character, my formulation is equivalent to yours. For a string of 140, the probability that the last character is a space is 1/k, for the last two characters being spaces, [itex]P=1/k^2[/itex], ........,for the "last" 140 characters, [itex]P=1/k^{140}[/itex]. This at least was my intent in including an empty space character.
  7. Apr 19, 2012 #6
    My bad. You are totally right. Using my approach the tweet "good_morning" is different from the tweet "good_morning_________", where "_" stands for space. Clearly, this does not make sense in this setting.
  8. Apr 19, 2012 #7
    No problem. Unfortunately, real language processing is orders of magnitude more complicated and involves inputting huge amounts of data re allowable strings and syntax. English is more difficult than continental European languages because of its "quaint" spelling. Add to that, the highly abbreviated non-standard language used in tweets and you have a real challenge. (Although some kind of standardized English based "Tweetish" would be probably easier to process than the standard dialect).
    Last edited: Apr 19, 2012
  9. Apr 19, 2012 #8


    User Avatar
    Science Advisor

    Take a look at grammars and construct of realizations of a particular grammar. Then you can supply a dictionary, a set of conditional probability distributions and then you can generate the things using a random number generator for the distributions and grammar distributions.

    The grammars will markovian in nature and the specificity of this will depend on how you arrange the tags, how they are linked together and their internal structure vs global structure.
  10. Apr 19, 2012 #9
    If I understand you, generating grammatically correct sentences isn't equivalent to generating sensible sentences. "He baked a ward with Venus and bled incoherently."
  11. Apr 19, 2012 #10


    User Avatar
    Science Advisor

    You can add as much constraint as you want with the grammar: it doesn't have to correspond to a normal spoken or written version we use: you can add features of txtspeak and any other kind of realization you want to include.

    Don't confuse grammar with English grammar: it's a general grammatical structure.
  12. Apr 20, 2012 #11
    My point was if we want to consider only tweets that make sense, as opposed to random sequences of characters, we have to consider both grammar and semantics. There are programs that can do this to a limited degree, but as far as I know, there's no feasible way to assuredly obtain every sensible statement within the 140 character limit regardless of whatever non-ideographic written general purpose language you choose. Regarding ideographic texts (ie Mandarin), I have no idea.
    Last edited: Apr 20, 2012
  13. Apr 20, 2012 #12


    User Avatar
    Science Advisor

    In terms of the actual grammatical structure in terms of the tags, their relationships to other tags and the structure overall, semantics just add to the structure.

    The grammar can be as detailed and as complex as you want where greater complexity allows you to have possibilities that you could not have and gives you more control than you would have lower complexity, but again it's in the grammar definition.

    As an example of what I mean at its most extreme, you could have all possible sentences in a tag each and then create an output tag that is basically an XOR statement of all the possible leaf tag definitions. Although you wouldn't do this, the point is that the grammatical structure can generate whatever you want it to generate.

    Of course you wouldn't do this: you would get a linguist to specify the semantics and syntantic issues to generate the final grammar which would be optimal in terms of description: in other words you want to generate a grammar of minimum complexity while retaining all the semantic and syntactic information for the valid realizations: in other words you are solving a kind of optimization problem with the constraints determined by the syntax, semantics and other relevant information that a linguistic specialist would supply.

    This is a language independent phenomenon and you could apply it even to Mandarin, just as you can apply it to representing the data structure of a bitmap, just as for specifying english text. As long as the alphabet is quantized (and you could extend it to a non-quantized alphabet in terms of the idea involved), then the idea doesn't change.
  14. Apr 20, 2012 #13


    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    I'm not sure what you're trying to do, but that sounds like a mistake. The numbers are absolutely ridiculous.

    I don't know what characters are allowed in a tweet, so I'll guess that there are 70 of them (26 lowercase letters, 26 uppercase letters, 10 numbers, a few non-alphabetic symbols). So there are 70^140 ≈ 2.05932837 × 10258 possible tweets. For comparison, the current age of the universe (≈13.7 billion years) is less than 1018 seconds.

    If you could generate a billion tweets per second for 13.7 billion years, then you will have generated about 4.32 × 1026 messages. That's a lot, right? But the number of tweets you still haven't generated is approximately

    2.05932837*10^258 - 4.32*10^26 = 2.05932837*10^258.

    So the number of tweets you still haven't generated is essentially unchanged...after a billion tweets per second for 13.7 billion years.

    How is this possible? 2.05932837*10^258 is a 259-digit number that starts with 2059328370000000000 (240 more zeroes after that). The computer has obviously rounded off to 9 significant figures. When we subtract the 27 digit number 4.32*10^26 from that, we get a 259-digit number that starts with 20593283699999999999 and then has nothing but nines until the last 28 digits. So when the computer displays the answer of the subtraction it rounds off 2.0593283699...(220 more nines, followed by 28 more digits) to 2.05932837. The error introduced by this roundoff is completely insignificant compared to the error that was introduced by keeping only 9 significant figures in the original calculation of 70^140. We would have had to keep at least 232 significant figures just to see that the number of remaining tweets will be smaller after 13.7 billion years.

    I haven't tried to calculate this, but I think the probability that any of the tweets generated in those 13.7 billion years will make sense is extremely small (if they are generated randomly).
    Last edited: Apr 20, 2012
  15. Apr 20, 2012 #14
    Yes, but if you have an infinite number of teenage girls tweeting, one of the tweets will eventually make sense as [itex]t \rightarrow \infty[/itex].
  16. Apr 24, 2012 #15


    User Avatar
    Science Advisor
    Gold Member

    No, that's known to be p=0; selection is not random and excludes sensible tweets. (apologies to teenage girls).
  17. Apr 24, 2012 #16
    no need to apologize, since there are infinitely many, p=0 only means "almost impossible", each singleton has probability 0.
  18. Apr 25, 2012 #17
    The technical bits have been explained already, but I would like to point out the glaring hole in your idea is the assumption of independance.

    For example, twitter tweets are not independant at all, hence the calculating the probabilities get infinitely more complicated as you have to factor in external events.

    Basically without investment your millions of dollars into research I don't see how what you want to do can be done.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook