Question about Wikipedia statistics page

Borg
Science Advisor
Gold Member
Messages
2,265
Reaction score
4,860
I have a question about an entry on the following Wikipedia page - tf-idf ( term frequency–inverse document frequency). In the Matematical Details section where they are describing the inverse document frequency as follows:

idf(t, D) = log \frac{|D|}{|{d \in D : t \in d}|}

This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

It states that it is common to adjust the the lower term as follows to avoid division by zero:

|{d \in D : t \in d}| → 1 + |{d \in D : t \in d}|

The article then continues on and states "the ratio inside the idf's log function is always greater than or equal to 1". However, if a word appears in every document in the set D, you end up with a log that ends up negative. Shouldn't the value of D be increased by 1 in order for idf (t, D) to always be >= 0?

I'm thinking that the final formula should look like this:

idf(t, D) = log \frac{1 + |D|}{1 + |{d \in D : t \in d}|}
 
Physics news on Phys.org
Borg said:
This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

To get the characters { and }, write \{ and \}.

However, if a word appears in every document in the set D, you end up with a log that ends up negative.

True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.
 
AlephZero said:
To get the characters { and }, write \{ and \}.



True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.
Ah, I was looking at it wrong. I get a little mathematically dislexic sometimes. I looked at the original case where the top and bottom were the same and saw that the log of 1 is 0. I then misinterpreted the statement that the bottom is always greater than one, to be the same as saying that the log would be greater than 0 for the new equation. Not true. Thanks!
 
Actually, avoiding division by 0 or "log(0)" is probably irrelevant on computers using IEEE-standard floating point arithmetic, since there is a special representation of "infinity" that has the properties you would expect it to have and it doesn't generate any arithmetic errors (e.g. 1.0 / 0.0 = "infinity", "infinity" compares greater than any finite number, log(0.0) = -"infinity", etc.)
 
Hi all, I've been a roulette player for more than 10 years (although I took time off here and there) and it's only now that I'm trying to understand the physics of the game. Basically my strategy in roulette is to divide the wheel roughly into two halves (let's call them A and B). My theory is that in roulette there will invariably be variance. In other words, if A comes up 5 times in a row, B will be due to come up soon. However I have been proven wrong many times, and I have seen some...
Thread 'Detail of Diagonalization Lemma'
The following is more or less taken from page 6 of C. Smorynski's "Self-Reference and Modal Logic". (Springer, 1985) (I couldn't get raised brackets to indicate codification (Gödel numbering), so I use a box. The overline is assigning a name. The detail I would like clarification on is in the second step in the last line, where we have an m-overlined, and we substitute the expression for m. Are we saying that the name of a coded term is the same as the coded term? Thanks in advance.
Back
Top