Question about Wikipedia statistics page

  • Context: Graduate 
  • Thread starter Thread starter Borg
  • Start date Start date
  • Tags Tags
    Statistics Wikipedia
Click For Summary

Discussion Overview

The discussion revolves around the mathematical formulation of the inverse document frequency (idf) in the context of tf-idf (term frequency–inverse document frequency) as presented on a Wikipedia page. Participants explore the implications of adjusting the formula to avoid division by zero and the behavior of the logarithmic function in this context.

Discussion Character

  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant questions whether the formula for idf should be adjusted to ensure it is always non-negative, suggesting that the value of D should be increased by 1.
  • Another participant acknowledges that a word appearing in every document is not useful for searching, implying that the specific behavior of the idf in such cases may not be critical.
  • A later reply reflects on a misunderstanding regarding the logarithmic properties, clarifying that the log of 1 is 0 and that the earlier interpretation of the formula was incorrect.
  • Another participant introduces the idea that division by zero or log(0) may not be a concern in practical applications due to the properties of IEEE-standard floating point arithmetic.

Areas of Agreement / Disagreement

Participants express differing views on the necessity of adjusting the idf formula and the implications of logarithmic behavior, indicating that multiple competing perspectives remain unresolved.

Contextual Notes

There are unresolved assumptions regarding the mathematical properties of logarithms and the practical implications of using certain adjustments in the idf formula.

Borg
Science Advisor
Gold Member
Messages
2,356
Reaction score
5,136
I have a question about an entry on the following Wikipedia page - tf-idf ( term frequency–inverse document frequency). In the Matematical Details section where they are describing the inverse document frequency as follows:

idf(t, D) = log [itex]\frac{|D|}{|{d \in D : t \in d}|}[/itex]

This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

It states that it is common to adjust the the lower term as follows to avoid division by zero:

|{d [itex]\in[/itex] D : t [itex]\in[/itex] d}| → 1 + |{d [itex]\in[/itex] D : t [itex]\in[/itex] d}|

The article then continues on and states "the ratio inside the idf's log function is always greater than or equal to 1". However, if a word appears in every document in the set D, you end up with a log that ends up negative. Shouldn't the value of D be increased by 1 in order for idf (t, D) to always be >= 0?

I'm thinking that the final formula should look like this:

idf(t, D) = log [itex]\frac{1 + |D|}{1 + |{d \in D : t \in d}|}[/itex]
 
Physics news on Phys.org
Borg said:
This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

To get the characters { and }, write \{ and \}.

However, if a word appears in every document in the set D, you end up with a log that ends up negative.

True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.
 
AlephZero said:
To get the characters { and }, write \{ and \}.



True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.
Ah, I was looking at it wrong. I get a little mathematically dislexic sometimes. I looked at the original case where the top and bottom were the same and saw that the log of 1 is 0. I then misinterpreted the statement that the bottom is always greater than one, to be the same as saying that the log would be greater than 0 for the new equation. Not true. Thanks!
 
Actually, avoiding division by 0 or "log(0)" is probably irrelevant on computers using IEEE-standard floating point arithmetic, since there is a special representation of "infinity" that has the properties you would expect it to have and it doesn't generate any arithmetic errors (e.g. 1.0 / 0.0 = "infinity", "infinity" compares greater than any finite number, log(0.0) = -"infinity", etc.)
 

Similar threads

  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 24 ·
Replies
24
Views
4K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 16 ·
Replies
16
Views
4K