- #1

- 9

- 0

I am implementing some unsupervised methods of content-selection/extraction based document summarization and I'm confused about what my textbook calls the "log-likelihood ratio". The book briefly describes it as such:

"The LLR for a word, generally called lambda(w), is the ratio between the probability of observing w in both the input and in the background corpus assuming equal probabilities in both corpora, and the probability of observing w in both assuming different probabilities for w in the input and the background corpus."

Breaking that down, we have the numerator: "the probability of observing w in both the input and in the background corpus assuming equal probabilities in both corpora" - How do I calculate what probability to use here?

and the denominator: "the probability of observing w in both assuming different probabilities for w in the input and the background corpus". - is this as simple as the probability of the word occurring in the input times the probability of the word occurring in the corpus? ex:

(count(word,input) / total words in input) * (count(word,corpus) / total words in corpus)

I've been looking over a paper my book references, Accurate Methods for the Statistics of Surprise and Coincidence (Dunning 1993), but I'm finding it difficult to relate to the problem of calculating LLR values for individual words in extraction based summarization. Any clarification here would be really appreciated.