How can we prove a text is written by a human beign?

  • Thread starter Thread starter doro
  • Start date Start date
  • Tags Tags
    Human Text
doro
Messages
2
Reaction score
0
A computer prints two papers,

on the first paper is written, "ASDDSF DFGDF ERSDKUF DFD RRVFDSDA"

on the second paper is written, "HOW ARE YOU HSHJFD SDF JGYJHNF LKJSFL"

One might say the first paper has some random letters which are generated by a computer randomly. But he might say the second paper is typed by a human because "HOW ARE YOU" is not made of random letters...

But how we can describe it in mathematics? How we can prove that the second paper most probably is typed by a human beign?
 
Physics news on Phys.org
If you are doing it at the level of letters into words you would simply run it through a dictionary and see how much matches - assuming you know the language.

The technique for making computer generated sentences that sort of make sense is called Markov chains, it's used to make spam messages that are human looking enough to get past spam filters.
 
Here's one idea

You have two databases of texts, one containing texts written by humans (genuine), one containing texts written by computers (spam).

For every word appearing in the texts, you calculate the probability of it appearing in a message, conditional on the message being genuine or spam.

Then you use Bayes' formula to give the probability of the text being genuine or spam, conditional on the words that appear in it.

If you want to make this more advanced, you could also compute conditional probabilities of pairs of words (digraphs) or triplets (trigraphs). You need a correspondingly larger database to do this, however. You also need to get around the problem that many genuine words will not appear in either database (Laplace's rule of succession can help here).

This is the basis of many email spam filters.
 
NobodySpecial said:
The technique for making computer generated sentences that sort of make sense is called Markov chains, it's used to make spam messages that are human looking enough to get past spam filters.

That reminds me about www.electric-gibberish.info[/URL], just visit subpages, not the main one.

Sounds like OP is asking for a kind of a Turing test.

[quote="Cexy, post: 2992259"]If you want to make this more advanced, you could also compute conditional probabilities of pairs of words (digraphs) or triplets (trigraphs). You need a correspondingly larger database to do this, however. You also need to get around the problem that many genuine words will not appear in either database (Laplace's rule of succession can help here).

This is the basis of many email spam filters.[/QUOTE]

That's why they can be easily fooled by text generated with Markov chains.
 
Last edited by a moderator:
I'm not sure whether the point of my question was clear enough so I write it in another words:

Suppose we received some signals from space but how we can prove that the signals are just randomly generated or sent by aliens?

Or consider this text, "DSABDT TDAB GLSC ABTGD"... Is this text randomly generated or is created by humans? I would say some one wrote this text because right after any A, there is a B and I think this pattern is not happened just by random.. But how we can describe this in mathematics?


Thank everyone for their replies...

Cexy said:
Here's one idea

You have two databases of texts, one containing texts written by humans (genuine), one containing texts written by computers (spam).

For every word appearing in the texts, you calculate the probability of it appearing in a message, conditional on the message being genuine or spam.

Then you use Bayes' formula to give the probability of the text being genuine or spam, conditional on the words that appear in it.

If you want to make this more advanced, you could also compute conditional probabilities of pairs of words (digraphs) or triplets (trigraphs). You need a correspondingly larger database to do this, however. You also need to get around the problem that many genuine words will not appear in either database (Laplace's rule of succession can help here).

This is the basis of many email spam filters.

Thank you for your reply... I don't know much about the Bayes' formula but I would study that...
 
doro said:
A computer prints two papers,

on the first paper is written, "ASDDSF DFGDF ERSDKUF DFD RRVFDSDA"

on the second paper is written, "HOW ARE YOU HSHJFD SDF JGYJHNF LKJSFL"

One might say the first paper has some random letters which are generated by a computer randomly. But he might say the second paper is typed by a human because "HOW ARE YOU" is not made of random letters...

But how we can describe it in mathematics? How we can prove that the second paper most probably is typed by a human beign?

On top of what has been mentioned about the probabilistic approach (ie markov modeling) you should look at a topic known as grammars. It's usually part of a compilers course in a computer science stream, but if you want to construct sentences or in fact any kind of structured language you will need to constrain the variety of structures to something that is based on a formally defined grammar.

If you are doing a computer science course you might want to get a book on compilers and look at what token grammar representations allow you to to do. For this search for expositions on Backus Naur Form Grammars or there extended variety.

If you a programmer you can look at something like Yacc which will satisfy your curiosity as it parses such language definitions.

Combining grammars with statistics and building on this is probably a good start for the application you had in mind.
 
doro said:
I'm not sure whether the point of my question was clear enough so I write it in another words:

Suppose we received some signals from space but how we can prove that the signals are just randomly generated or sent by aliens?

Or consider this text, "DSABDT TDAB GLSC ABTGD"... Is this text randomly generated or is created by humans? I would say some one wrote this text because right after any A, there is a B and I think this pattern is not happened just by random.. But how we can describe this in mathematics?


Thank everyone for their replies...



Thank you for your reply... I don't know much about the Bayes' formula but I would study that...

A truly random word generator would produce "HOWAREYOU" as frequently as "MEJWXLTOW". There is no sure way to weed out sentences produced by random or non-random generators. You would have to rely on heuristical rules- for example you may believe a truly random word would have equal amount of a's, b's, c's etc.
 
doro said:
A computer prints two papers,

on the first paper is written, "ASDDSF DFGDF ERSDKUF DFD RRVFDSDA"

on the second paper is written, "HOW ARE YOU HSHJFD SDF JGYJHNF LKJSFL"

One might say the first paper has some random letters which are generated by a computer randomly. But he might say the second paper is typed by a human because "HOW ARE YOU" is not made of random letters...

But how we can describe it in mathematics? How we can prove that the second paper most probably is typed by a human beign?

Nobody can prove it.
 
answer depends on the size of the pattern found, the paper, the punctuation. (i.e. with or without caps. , alphanumeric, symbols etc...). and other factors...

ex.

What is the probability of anyone combination out of this (the possible number combinations of 20 letter English sentences ), (approx. 500,000 words in english.)

95 367 431 640 625 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000.

appearing into 1 page, selected from infinitely random text?

where did your gibberish text come from? how was it randomized?
 
Last edited:
  • #10
Perhaps if there are misspellings like "beign" for "being" we would be more likely to think is was human generated.
 
  • #11
They can be introduced for a purpose.

As I wrote earlier - this is not much different from Turing test. Please visit http://www.loebner.net/Prizef/loebner-prize.html and read about the competition as well as transcripts from the previous editions.
 
  • #12
This is probably not what the authoer is asking. But a way to see whether a text is random or not, is by information theory. The entropy (developped by the great mathematician Shannon), gives us an answer whether a text contains much information or not. It will be expected that a random text will not contain much information, while a genuine text will contain much information. Thus the entropy can be a way to discriminate between random texts and genuine text.
 
  • #13
micromass said:
It will be expected that a random text will not contain much information, while a genuine text will contain much information. Thus the entropy can be a way to discriminate between random texts and genuine text.

Shannon information is a measure of entropy. Entropy is a measure of the number of states in which a system can exist. How many ways can you arrange n characters (including spaces,only lower case letters and punctuation) in a string of length k; say about 30 characters in total and a message of k =142? Each character has p=1/30 so the number of possible states of the system is 30^{142}= r and the probability of any given state is p= 1/r. The Gibbs entropy (H) is:

H(X)=-\sum^{k}_{i=0}p_{i} log_{2} p_{i}

This is essentially the same equation as for Shannon Information. What Information Theory is about is the preservation of a string through a communication channel. The string itself can be anything. Meaning is imputed by the users. What's important is preservation of the particular string. It's obvious that the probability of a particular string is very low, corresponding to high entropy and high information content.

http://homepages.cwi.nl/~paulv/papers/info.pdf
 
Last edited:
Back
Top