Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Question about Wikipedia statistics page

  1. Apr 7, 2013 #1

    Borg

    User Avatar
    Gold Member

    I have a question about an entry on the following Wikipedia page - tf-idf ( term frequency–inverse document frequency). In the Matematical Details section where they are describing the inverse document frequency as follows:

    idf(t, D) = log [itex]\frac{|D|}{|{d \in D : t \in d}|}[/itex]

    This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

    It states that it is common to adjust the the lower term as follows to avoid division by zero:

    |{d [itex]\in[/itex] D : t [itex]\in[/itex] d}| → 1 + |{d [itex]\in[/itex] D : t [itex]\in[/itex] d}|

    The article then continues on and states "the ratio inside the idf's log function is always greater than or equal to 1". However, if a word appears in every document in the set D, you end up with a log that ends up negative. Shouldn't the value of D be increased by 1 in order for idf (t, D) to always be >= 0?

    I'm thinking that the final formula should look like this:

    idf(t, D) = log [itex]\frac{1 + |D|}{1 + |{d \in D : t \in d}|}[/itex]
     
  2. jcsd
  3. Apr 7, 2013 #2

    AlephZero

    User Avatar
    Science Advisor
    Homework Helper

    To get the characters { and }, write \{ and \}.

    True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.
     
  4. Apr 7, 2013 #3

    Borg

    User Avatar
    Gold Member

    Ah, I was looking at it wrong. I get a little mathematically dislexic sometimes. I looked at the original case where the top and bottom were the same and saw that the log of 1 is 0. I then misinterpreted the statement that the bottom is always greater than one, to be the same as saying that the log would be greater than 0 for the new equation. Not true. Thanks!
     
  5. Apr 7, 2013 #4

    AlephZero

    User Avatar
    Science Advisor
    Homework Helper

    Actually, avoiding division by 0 or "log(0)" is probably irrelevant on computers using IEEE-standard floating point arithmetic, since there is a special representation of "infinity" that has the properties you would expect it to have and it doesn't generate any arithmetic errors (e.g. 1.0 / 0.0 = "infinity", "infinity" compares greater than any finite number, log(0.0) = -"infinity", etc.)
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: Question about Wikipedia statistics page
  1. Statistics Question (Replies: 6)

  2. Statistics Question (Replies: 1)

Loading...