How can we prove a text is written by a human beign?

  • Context: Undergrad 
  • Thread starter Thread starter doro
  • Start date Start date
  • Tags Tags
    Human Text
Click For Summary

Discussion Overview

The discussion revolves around the challenge of determining whether a given text is generated by a human or a computer. Participants explore various mathematical and statistical approaches to differentiate between human-written and computer-generated texts, including the use of probability, Markov chains, and grammatical structures.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants suggest using a dictionary to check the match of words in a text to determine if it is human-generated.
  • One idea involves creating two databases of texts—one for human-written and one for computer-generated texts—and using Bayes' formula to calculate the probability of a text being genuine or spam based on word occurrences.
  • Another participant mentions the use of Markov chains in generating human-like text and highlights its application in spam filters.
  • There is a proposal to incorporate grammatical structures into the analysis, referencing formal grammar definitions like Backus-Naur Form.
  • Some participants express uncertainty about the ability to definitively prove whether a text is human-generated or not, citing the complexity of randomness and patterns in text generation.
  • One participant notes that misspellings might indicate human generation, while another emphasizes that a truly random generator would produce a variety of outputs without bias.
  • Concerns are raised about the limitations of heuristics and the need for larger databases to improve the accuracy of the models discussed.
  • There is acknowledgment that the answer depends on various factors, including the size of patterns found and the nature of the text itself.

Areas of Agreement / Disagreement

Participants do not reach a consensus on a definitive method to prove whether a text is human-generated or computer-generated. Multiple competing views and approaches remain, with ongoing debate about the effectiveness of different techniques.

Contextual Notes

Limitations include the dependence on the definitions of randomness and human-like text, the need for extensive databases, and unresolved mathematical complexities in the proposed methods.

doro
Messages
2
Reaction score
0
A computer prints two papers,

on the first paper is written, "ASDDSF DFGDF ERSDKUF DFD RRVFDSDA"

on the second paper is written, "HOW ARE YOU HSHJFD SDF JGYJHNF LKJSFL"

One might say the first paper has some random letters which are generated by a computer randomly. But he might say the second paper is typed by a human because "HOW ARE YOU" is not made of random letters...

But how we can describe it in mathematics? How we can prove that the second paper most probably is typed by a human beign?
 
Physics news on Phys.org
If you are doing it at the level of letters into words you would simply run it through a dictionary and see how much matches - assuming you know the language.

The technique for making computer generated sentences that sort of make sense is called Markov chains, it's used to make spam messages that are human looking enough to get past spam filters.
 
Here's one idea

You have two databases of texts, one containing texts written by humans (genuine), one containing texts written by computers (spam).

For every word appearing in the texts, you calculate the probability of it appearing in a message, conditional on the message being genuine or spam.

Then you use Bayes' formula to give the probability of the text being genuine or spam, conditional on the words that appear in it.

If you want to make this more advanced, you could also compute conditional probabilities of pairs of words (digraphs) or triplets (trigraphs). You need a correspondingly larger database to do this, however. You also need to get around the problem that many genuine words will not appear in either database (Laplace's rule of succession can help here).

This is the basis of many email spam filters.
 
NobodySpecial said:
The technique for making computer generated sentences that sort of make sense is called Markov chains, it's used to make spam messages that are human looking enough to get past spam filters.

That reminds me about www.electric-gibberish.info[/URL], just visit subpages, not the main one.

Sounds like OP is asking for a kind of a Turing test.

[quote="Cexy, post: 2992259"]If you want to make this more advanced, you could also compute conditional probabilities of pairs of words (digraphs) or triplets (trigraphs). You need a correspondingly larger database to do this, however. You also need to get around the problem that many genuine words will not appear in either database (Laplace's rule of succession can help here).

This is the basis of many email spam filters.[/QUOTE]

That's why they can be easily fooled by text generated with Markov chains.
 
Last edited by a moderator:
I'm not sure whether the point of my question was clear enough so I write it in another words:

Suppose we received some signals from space but how we can prove that the signals are just randomly generated or sent by aliens?

Or consider this text, "DSABDT TDAB GLSC ABTGD"... Is this text randomly generated or is created by humans? I would say some one wrote this text because right after any A, there is a B and I think this pattern is not happened just by random.. But how we can describe this in mathematics?


Thank everyone for their replies...

Cexy said:
Here's one idea

You have two databases of texts, one containing texts written by humans (genuine), one containing texts written by computers (spam).

For every word appearing in the texts, you calculate the probability of it appearing in a message, conditional on the message being genuine or spam.

Then you use Bayes' formula to give the probability of the text being genuine or spam, conditional on the words that appear in it.

If you want to make this more advanced, you could also compute conditional probabilities of pairs of words (digraphs) or triplets (trigraphs). You need a correspondingly larger database to do this, however. You also need to get around the problem that many genuine words will not appear in either database (Laplace's rule of succession can help here).

This is the basis of many email spam filters.

Thank you for your reply... I don't know much about the Bayes' formula but I would study that...
 
doro said:
A computer prints two papers,

on the first paper is written, "ASDDSF DFGDF ERSDKUF DFD RRVFDSDA"

on the second paper is written, "HOW ARE YOU HSHJFD SDF JGYJHNF LKJSFL"

One might say the first paper has some random letters which are generated by a computer randomly. But he might say the second paper is typed by a human because "HOW ARE YOU" is not made of random letters...

But how we can describe it in mathematics? How we can prove that the second paper most probably is typed by a human beign?

On top of what has been mentioned about the probabilistic approach (ie markov modeling) you should look at a topic known as grammars. It's usually part of a compilers course in a computer science stream, but if you want to construct sentences or in fact any kind of structured language you will need to constrain the variety of structures to something that is based on a formally defined grammar.

If you are doing a computer science course you might want to get a book on compilers and look at what token grammar representations allow you to to do. For this search for expositions on Backus Naur Form Grammars or there extended variety.

If you a programmer you can look at something like Yacc which will satisfy your curiosity as it parses such language definitions.

Combining grammars with statistics and building on this is probably a good start for the application you had in mind.
 
doro said:
I'm not sure whether the point of my question was clear enough so I write it in another words:

Suppose we received some signals from space but how we can prove that the signals are just randomly generated or sent by aliens?

Or consider this text, "DSABDT TDAB GLSC ABTGD"... Is this text randomly generated or is created by humans? I would say some one wrote this text because right after any A, there is a B and I think this pattern is not happened just by random.. But how we can describe this in mathematics?


Thank everyone for their replies...



Thank you for your reply... I don't know much about the Bayes' formula but I would study that...

A truly random word generator would produce "HOWAREYOU" as frequently as "MEJWXLTOW". There is no sure way to weed out sentences produced by random or non-random generators. You would have to rely on heuristical rules- for example you may believe a truly random word would have equal amount of a's, b's, c's etc.
 
doro said:
A computer prints two papers,

on the first paper is written, "ASDDSF DFGDF ERSDKUF DFD RRVFDSDA"

on the second paper is written, "HOW ARE YOU HSHJFD SDF JGYJHNF LKJSFL"

One might say the first paper has some random letters which are generated by a computer randomly. But he might say the second paper is typed by a human because "HOW ARE YOU" is not made of random letters...

But how we can describe it in mathematics? How we can prove that the second paper most probably is typed by a human beign?

Nobody can prove it.
 
answer depends on the size of the pattern found, the paper, the punctuation. (i.e. with or without caps. , alphanumeric, symbols etc...). and other factors...

ex.

What is the probability of anyone combination out of this (the possible number combinations of 20 letter English sentences ), (approx. 500,000 words in english.)

95 367 431 640 625 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000.

appearing into 1 page, selected from infinitely random text?

where did your gibberish text come from? how was it randomized?
 
Last edited:
  • #10
Perhaps if there are misspellings like "beign" for "being" we would be more likely to think is was human generated.
 
  • #11
They can be introduced for a purpose.

As I wrote earlier - this is not much different from Turing test. Please visit http://www.loebner.net/Prizef/loebner-prize.html and read about the competition as well as transcripts from the previous editions.
 
  • #12
This is probably not what the authoer is asking. But a way to see whether a text is random or not, is by information theory. The entropy (developped by the great mathematician Shannon), gives us an answer whether a text contains much information or not. It will be expected that a random text will not contain much information, while a genuine text will contain much information. Thus the entropy can be a way to discriminate between random texts and genuine text.
 
  • #13
micromass said:
It will be expected that a random text will not contain much information, while a genuine text will contain much information. Thus the entropy can be a way to discriminate between random texts and genuine text.

Shannon information is a measure of entropy. Entropy is a measure of the number of states in which a system can exist. How many ways can you arrange n characters (including spaces,only lower case letters and punctuation) in a string of length k; say about 30 characters in total and a message of k =142? Each character has p=1/30 so the number of possible states of the system is [tex]30^{142}= r[/tex] and the probability of any given state is p= 1/r. The Gibbs entropy (H) is:

[tex]H(X)=-\sum^{k}_{i=0}p_{i} log_{2} p_{i}[/tex]

This is essentially the same equation as for Shannon Information. What Information Theory is about is the preservation of a string through a communication channel. The string itself can be anything. Meaning is imputed by the users. What's important is preservation of the particular string. It's obvious that the probability of a particular string is very low, corresponding to high entropy and high information content.

http://homepages.cwi.nl/~paulv/papers/info.pdf
 
Last edited:

Similar threads

  • · Replies 17 ·
Replies
17
Views
4K
  • · Replies 27 ·
Replies
27
Views
3K
Replies
10
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
Replies
5
Views
3K
  • · Replies 16 ·
Replies
16
Views
3K
  • · Replies 22 ·
Replies
22
Views
3K
Replies
5
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K