Question about Wikipedia statistics page

Borg · Apr 7, 2013

I have a question about an entry on the following Wikipedia page - tf-idf ( term frequency–inverse document frequency). In the Matematical Details section where they are describing the inverse document frequency as follows:

idf(t, D) = log \frac{|D|}{|{d \in D : t \in d}|}

This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

It states that it is common to adjust the the lower term as follows to avoid division by zero:

|{d \in D : t \in d}| → 1 + |{d \in D : t \in d}|

The article then continues on and states "the ratio inside the idf's log function is always greater than or equal to 1". However, if a word appears in every document in the set D, you end up with a log that ends up negative. Shouldn't the value of D be increased by 1 in order for idf (t, D) to always be >= 0?

I'm thinking that the final formula should look like this:

idf(t, D) = log \frac{1 + |D|}{1 + |{d \in D : t \in d}|}

AlephZero · Apr 7, 2013

Borg said:

This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

To get the characters { and }, write \{ and \}.

However, if a word appears in every document in the set D, you end up with a log that ends up negative.

True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.

Borg · Apr 7, 2013

AlephZero said:

To get the characters { and }, write \{ and \}.

True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.

Ah, I was looking at it wrong. I get a little mathematically dislexic sometimes. I looked at the original case where the top and bottom were the same and saw that the log of 1 is 0. I then misinterpreted the statement that the bottom is always greater than one, to be the same as saying that the log would be greater than 0 for the new equation. Not true. Thanks!

AlephZero · Apr 7, 2013

Actually, avoiding division by 0 or "log(0)" is probably irrelevant on computers using IEEE-standard floating point arithmetic, since there is a special representation of "infinity" that has the properties you would expect it to have and it doesn't generate any arithmetic errors (e.g. 1.0 / 0.0 = "infinity", "infinity" compares greater than any finite number, log(0.0) = -"infinity", etc.)

Question about Wikipedia statistics page

Thread 'Onto set mapping is the surjective set mapping, and into injective?'

Thread 'Roulette wheel physics and probability'

Thread 'Detail of Diagonalization Lemma'

Similar threads

Hot Threads

B A Little Probability Puzzle

I Need help solving this Existence Algorithm for truth

A Does this computation satisfy LTL formulas?

A Prove that points which are indistinguishable from 0 exist (using logic)

A Mathematical Connection between Cosmic Expansion and Exponential Growth

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective