I have a question about an entry on the following Wikipedia page - tf-idf ( term frequency–inverse document frequency). In the Matematical Details section where they are describing the inverse document frequency as follows:(adsbygoogle = window.adsbygoogle || []).push({});

idf(t, D) = log [itex]\frac{|D|}{|{d \in D : t \in d}|}[/itex]

This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

It states that it is common to adjust the the lower term as follows to avoid division by zero:

|{d [itex]\in[/itex] D : t [itex]\in[/itex] d}| → 1 + |{d [itex]\in[/itex] D : t [itex]\in[/itex] d}|

The article then continues on and states "the ratio inside the idf's log function is always greater than or equal to 1". However, if a word appears in every document in the set D, you end up with a log that ends up negative. Shouldn't the value of D be increased by 1 in order for idf (t, D) to always be >= 0?

I'm thinking that the final formula should look like this:

idf(t, D) = log [itex]\frac{1 + |D|}{1 + |{d \in D : t \in d}|}[/itex]

**Physics Forums - The Fusion of Science and Community**

The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

# Question about Wikipedia statistics page

Loading...

Similar Threads - Question Wikipedia statistics | Date |
---|---|

I Question about simplifying Sigma notation | Feb 11, 2018 |

I Shopping List Game: Probability Question | Dec 10, 2017 |

I A simple question about probability theory | Aug 2, 2017 |

B Correlation question | Jun 20, 2017 |

Linear regression, sources for this wikipedia link | Jul 25, 2014 |

**Physics Forums - The Fusion of Science and Community**