Question about Wikipedia statistics page

Borg · Apr 7, 2013

I have a question about an entry on the following Wikipedia page - tf-idf ( term frequency–inverse document frequency). In the Matematical Details section where they are describing the inverse document frequency as follows:

idf(t, D) = log [itex]\frac{|D|}{|{d \in D : t \in d}|}[/itex]

This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

It states that it is common to adjust the the lower term as follows to avoid division by zero:

|{d [itex]\in[/itex] D : t [itex]\in[/itex] d}| → 1 + |{d [itex]\in[/itex] D : t [itex]\in[/itex] d}|

The article then continues on and states "the ratio inside the idf's log function is always greater than or equal to 1". However, if a word appears in every document in the set D, you end up with a log that ends up negative. Shouldn't the value of D be increased by 1 in order for idf (t, D) to always be >= 0?

I'm thinking that the final formula should look like this:

idf(t, D) = log [itex]\frac{1 + |D|}{1 + |{d \in D : t \in d}|}[/itex]

AlephZero · Apr 7, 2013

Borg said:

This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

To get the characters { and }, write \{ and \}.

However, if a word appears in every document in the set D, you end up with a log that ends up negative.

True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.

Borg · Apr 7, 2013

AlephZero said:

To get the characters { and }, write \{ and \}.

True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.

Ah, I was looking at it wrong. I get a little mathematically dislexic sometimes. I looked at the original case where the top and bottom were the same and saw that the log of 1 is 0. I then misinterpreted the statement that the bottom is always greater than one, to be the same as saying that the log would be greater than 0 for the new equation. Not true. Thanks!

AlephZero · Apr 7, 2013

Actually, avoiding division by 0 or "log(0)" is probably irrelevant on computers using IEEE-standard floating point arithmetic, since there is a special representation of "infinity" that has the properties you would expect it to have and it doesn't generate any arithmetic errors (e.g. 1.0 / 0.0 = "infinity", "infinity" compares greater than any finite number, log(0.0) = -"infinity", etc.)

blue_raver22 · Apr 14, 2013

Thank you for bringing this to my attention. I always appreciate when individuals take the time to critically examine and question information presented to them.

In this case, I believe your suggestion for adjusting the lower term in the inverse document frequency formula is valid. As you mentioned, if a word appears in every document in the set D, the ratio inside the log function would result in a negative value. By increasing the lower term by 1, we ensure that the result of the log function is always greater than or equal to 0.

I would also like to add that the purpose of the inverse document frequency is to give a higher weight to terms that are rare in a document collection. By increasing the lower term, we are essentially making these rare terms even more distinct and valuable in the calculation of tf-idf.

However, I would suggest double-checking with other sources or experts in the field to confirm that this adjustment is commonly used and accepted. It is always important to validate information from multiple sources to ensure accuracy.

Thank you again for bringing this to my attention and for your critical thinking. I hope this helps clarify the issue for you.

Question about Wikipedia statistics page

1. What is the purpose of the Wikipedia statistics page?

2. How often is the Wikipedia statistics page updated?

3. What is the meaning of the different metrics on the Wikipedia statistics page?

4. Is the Wikipedia statistics page accurate?

5. How can I use the data on the Wikipedia statistics page?

Similar threads

Hot Threads

Recent Insights