How can we prove a text is written by a human beign?

In summary, the conversation discusses various techniques for determining whether a text or signal has been randomly generated or created by a human. These techniques include using probability calculations, Markov chains, and grammars to analyze patterns and structures within the text or signal. By applying these methods, it is possible to determine the likelihood that the text or signal was created by a human rather than generated randomly.
  • #1
doro
2
0
A computer prints two papers,

on the first paper is written, "ASDDSF DFGDF ERSDKUF DFD RRVFDSDA"

on the second paper is written, "HOW ARE YOU HSHJFD SDF JGYJHNF LKJSFL"

One might say the first paper has some random letters which are generated by a computer randomly. But he might say the second paper is typed by a human because "HOW ARE YOU" is not made of random letters...

But how we can describe it in mathematics? How we can prove that the second paper most probably is typed by a human beign?
 
Physics news on Phys.org
  • #2
If you are doing it at the level of letters into words you would simply run it through a dictionary and see how much matches - assuming you know the language.

The technique for making computer generated sentences that sort of make sense is called Markov chains, it's used to make spam messages that are human looking enough to get past spam filters.
 
  • #3
Here's one idea

You have two databases of texts, one containing texts written by humans (genuine), one containing texts written by computers (spam).

For every word appearing in the texts, you calculate the probability of it appearing in a message, conditional on the message being genuine or spam.

Then you use Bayes' formula to give the probability of the text being genuine or spam, conditional on the words that appear in it.

If you want to make this more advanced, you could also compute conditional probabilities of pairs of words (digraphs) or triplets (trigraphs). You need a correspondingly larger database to do this, however. You also need to get around the problem that many genuine words will not appear in either database (Laplace's rule of succession can help here).

This is the basis of many email spam filters.
 
  • #4
NobodySpecial said:
The technique for making computer generated sentences that sort of make sense is called Markov chains, it's used to make spam messages that are human looking enough to get past spam filters.

That reminds me about www.electric-gibberish.info[/URL], just visit subpages, not the main one.

Sounds like OP is asking for a kind of a Turing test.

[quote="Cexy, post: 2992259"]If you want to make this more advanced, you could also compute conditional probabilities of pairs of words (digraphs) or triplets (trigraphs). You need a correspondingly larger database to do this, however. You also need to get around the problem that many genuine words will not appear in either database (Laplace's rule of succession can help here).

This is the basis of many email spam filters.[/QUOTE]

That's why they can be easily fooled by text generated with Markov chains.
 
Last edited by a moderator:
  • #5
I'm not sure whether the point of my question was clear enough so I write it in another words:

Suppose we received some signals from space but how we can prove that the signals are just randomly generated or sent by aliens?

Or consider this text, "DSABDT TDAB GLSC ABTGD"... Is this text randomly generated or is created by humans? I would say some one wrote this text because right after any A, there is a B and I think this pattern is not happened just by random.. But how we can describe this in mathematics?


Thank everyone for their replies...

Cexy said:
Here's one idea

You have two databases of texts, one containing texts written by humans (genuine), one containing texts written by computers (spam).

For every word appearing in the texts, you calculate the probability of it appearing in a message, conditional on the message being genuine or spam.

Then you use Bayes' formula to give the probability of the text being genuine or spam, conditional on the words that appear in it.

If you want to make this more advanced, you could also compute conditional probabilities of pairs of words (digraphs) or triplets (trigraphs). You need a correspondingly larger database to do this, however. You also need to get around the problem that many genuine words will not appear in either database (Laplace's rule of succession can help here).

This is the basis of many email spam filters.

Thank you for your reply... I don't know much about the Bayes' formula but I would study that...
 
  • #6
doro said:
A computer prints two papers,

on the first paper is written, "ASDDSF DFGDF ERSDKUF DFD RRVFDSDA"

on the second paper is written, "HOW ARE YOU HSHJFD SDF JGYJHNF LKJSFL"

One might say the first paper has some random letters which are generated by a computer randomly. But he might say the second paper is typed by a human because "HOW ARE YOU" is not made of random letters...

But how we can describe it in mathematics? How we can prove that the second paper most probably is typed by a human beign?

On top of what has been mentioned about the probabilistic approach (ie markov modeling) you should look at a topic known as grammars. It's usually part of a compilers course in a computer science stream, but if you want to construct sentences or in fact any kind of structured language you will need to constrain the variety of structures to something that is based on a formally defined grammar.

If you are doing a computer science course you might want to get a book on compilers and look at what token grammar representations allow you to to do. For this search for expositions on Backus Naur Form Grammars or there extended variety.

If you a programmer you can look at something like Yacc which will satisfy your curiosity as it parses such language definitions.

Combining grammars with statistics and building on this is probably a good start for the application you had in mind.
 
  • #7
doro said:
I'm not sure whether the point of my question was clear enough so I write it in another words:

Suppose we received some signals from space but how we can prove that the signals are just randomly generated or sent by aliens?

Or consider this text, "DSABDT TDAB GLSC ABTGD"... Is this text randomly generated or is created by humans? I would say some one wrote this text because right after any A, there is a B and I think this pattern is not happened just by random.. But how we can describe this in mathematics?


Thank everyone for their replies...



Thank you for your reply... I don't know much about the Bayes' formula but I would study that...

A truly random word generator would produce "HOWAREYOU" as frequently as "MEJWXLTOW". There is no sure way to weed out sentences produced by random or non-random generators. You would have to rely on heuristical rules- for example you may believe a truly random word would have equal amount of a's, b's, c's etc.
 
  • #8
doro said:
A computer prints two papers,

on the first paper is written, "ASDDSF DFGDF ERSDKUF DFD RRVFDSDA"

on the second paper is written, "HOW ARE YOU HSHJFD SDF JGYJHNF LKJSFL"

One might say the first paper has some random letters which are generated by a computer randomly. But he might say the second paper is typed by a human because "HOW ARE YOU" is not made of random letters...

But how we can describe it in mathematics? How we can prove that the second paper most probably is typed by a human beign?

Nobody can prove it.
 
  • #9
answer depends on the size of the pattern found, the paper, the punctuation. (i.e. with or without caps. , alphanumeric, symbols etc...). and other factors...

ex.

What is the probability of anyone combination out of this (the possible number combinations of 20 letter English sentences ), (approx. 500,000 words in english.)

95 367 431 640 625 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000.

appearing into 1 page, selected from infinitely random text?

where did your gibberish text come from? how was it randomized?
 
Last edited:
  • #10
Perhaps if there are misspellings like "beign" for "being" we would be more likely to think is was human generated.
 
  • #11
They can be introduced for a purpose.

As I wrote earlier - this is not much different from Turing test. Please visit http://www.loebner.net/Prizef/loebner-prize.html and read about the competition as well as transcripts from the previous editions.
 
  • #12
This is probably not what the authoer is asking. But a way to see whether a text is random or not, is by information theory. The entropy (developped by the great mathematician Shannon), gives us an answer whether a text contains much information or not. It will be expected that a random text will not contain much information, while a genuine text will contain much information. Thus the entropy can be a way to discriminate between random texts and genuine text.
 
  • #13
micromass said:
It will be expected that a random text will not contain much information, while a genuine text will contain much information. Thus the entropy can be a way to discriminate between random texts and genuine text.

Shannon information is a measure of entropy. Entropy is a measure of the number of states in which a system can exist. How many ways can you arrange n characters (including spaces,only lower case letters and punctuation) in a string of length k; say about 30 characters in total and a message of k =142? Each character has p=1/30 so the number of possible states of the system is [tex]30^{142}= r [/tex] and the probability of any given state is p= 1/r. The Gibbs entropy (H) is:

[tex]H(X)=-\sum^{k}_{i=0}p_{i} log_{2} p_{i}[/tex]

This is essentially the same equation as for Shannon Information. What Information Theory is about is the preservation of a string through a communication channel. The string itself can be anything. Meaning is imputed by the users. What's important is preservation of the particular string. It's obvious that the probability of a particular string is very low, corresponding to high entropy and high information content.

http://homepages.cwi.nl/~paulv/papers/info.pdf
 
Last edited:

1. How can we determine the authorship of a text?

There are several methods that can be used to determine the authorship of a text. One common approach is to analyze the writing style and language used in the text, such as word choice, sentence structure, and use of punctuation. Another method is to compare the text to known works of the suspected author to look for similarities in writing style. Additionally, handwriting analysis and linguistic analysis can also be used to determine authorship.

2. Can we use technology to prove a text is written by a human being?

Yes, technology can be used to analyze and compare texts to determine if they were written by a human being. Natural Language Processing (NLP) techniques can be used to analyze writing style and language patterns, and handwriting analysis software can be used to analyze handwritten texts. However, it is important to note that these methods are not foolproof and may not always provide definitive answers.

3. How can we distinguish between a human-written text and a machine-generated text?

One way to distinguish between a human-written text and a machine-generated text is to look for patterns and consistency in the writing. Machine-generated texts may lack variation and human-like errors, and may also contain unnatural or repetitive language. Additionally, technology-based methods, such as NLP and handwriting analysis, can also be used to differentiate between human and machine writing.

4. Are there any limitations to using technology to prove authorship?

Yes, there are some limitations to using technology to prove authorship. For example, the accuracy of NLP and handwriting analysis tools can be affected by the quality of the text being analyzed. Handwriting analysis may also be limited if there are not enough known samples of the suspected author's handwriting for comparison. Additionally, technology-based methods may not be able to account for intentional changes in writing style or attempts to mimic another author's style.

5. Is it possible for a text to have multiple authors?

Yes, it is possible for a text to have multiple authors. In fact, many texts, such as collaborative writing projects or academic papers, may have multiple authors. In these cases, it may be more difficult to determine the specific contributions of each author, but technology-based methods can still be used to analyze the overall writing style and language patterns of the text.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
17
Views
3K
Replies
10
Views
2K
  • Electrical Engineering
Replies
22
Views
3K
  • Quantum Interpretations and Foundations
2
Replies
45
Views
3K
  • Programming and Computer Science
Replies
29
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
3K
  • Introductory Physics Homework Help
Replies
2
Views
5K
  • Computing and Technology
Replies
5
Views
14K
  • Calculus and Beyond Homework Help
Replies
1
Views
3K
Replies
7
Views
3K
Back
Top