Question about Wikipedia statistics page

In summary, the conversation discusses the mathematical details of inverse document frequency (idf) in the context of tf-idf (term frequency-inverse document frequency). It is common to adjust the lower term in order to avoid division by zero, and the article states that the ratio inside the log function is always greater than or equal to 1. However, if a word appears in every document in the set D, the log function results in a negative value. The proposed solution is to increase the value of D by 1 in order for idf (t, D) to always be greater than or equal to 0. It is noted that this may not be necessary since a word that appears in every document is not useful for searching.
  • #1
Borg
Science Advisor
Gold Member
2,179
4,239
I have a question about an entry on the following Wikipedia page - tf-idf ( term frequency–inverse document frequency). In the Matematical Details section where they are describing the inverse document frequency as follows:

idf(t, D) = log [itex]\frac{|D|}{|{d \in D : t \in d}|}[/itex]

This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

It states that it is common to adjust the the lower term as follows to avoid division by zero:

|{d [itex]\in[/itex] D : t [itex]\in[/itex] d}| → 1 + |{d [itex]\in[/itex] D : t [itex]\in[/itex] d}|

The article then continues on and states "the ratio inside the idf's log function is always greater than or equal to 1". However, if a word appears in every document in the set D, you end up with a log that ends up negative. Shouldn't the value of D be increased by 1 in order for idf (t, D) to always be >= 0?

I'm thinking that the final formula should look like this:

idf(t, D) = log [itex]\frac{1 + |D|}{1 + |{d \in D : t \in d}|}[/itex]
 
Physics news on Phys.org
  • #2
Borg said:
This is my first attempt at Latex so I couldn't figure out how to put the {} symbols around the lower term.

To get the characters { and }, write \{ and \}.

However, if a word appears in every document in the set D, you end up with a log that ends up negative.

True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.
 
  • #3
AlephZero said:
To get the characters { and }, write \{ and \}.



True, but since a word that appears in every document is useless for searching (and so it a word that appears in no documents), it probably doesn't matter much what you do so long as the computer program doesn't crash.
Ah, I was looking at it wrong. I get a little mathematically dislexic sometimes. I looked at the original case where the top and bottom were the same and saw that the log of 1 is 0. I then misinterpreted the statement that the bottom is always greater than one, to be the same as saying that the log would be greater than 0 for the new equation. Not true. Thanks!
 
  • #4
Actually, avoiding division by 0 or "log(0)" is probably irrelevant on computers using IEEE-standard floating point arithmetic, since there is a special representation of "infinity" that has the properties you would expect it to have and it doesn't generate any arithmetic errors (e.g. 1.0 / 0.0 = "infinity", "infinity" compares greater than any finite number, log(0.0) = -"infinity", etc.)
 
  • #5


Thank you for bringing this to my attention. I always appreciate when individuals take the time to critically examine and question information presented to them.

In this case, I believe your suggestion for adjusting the lower term in the inverse document frequency formula is valid. As you mentioned, if a word appears in every document in the set D, the ratio inside the log function would result in a negative value. By increasing the lower term by 1, we ensure that the result of the log function is always greater than or equal to 0.

I would also like to add that the purpose of the inverse document frequency is to give a higher weight to terms that are rare in a document collection. By increasing the lower term, we are essentially making these rare terms even more distinct and valuable in the calculation of tf-idf.

However, I would suggest double-checking with other sources or experts in the field to confirm that this adjustment is commonly used and accepted. It is always important to validate information from multiple sources to ensure accuracy.

Thank you again for bringing this to my attention and for your critical thinking. I hope this helps clarify the issue for you.
 

1. What is the purpose of the Wikipedia statistics page?

The Wikipedia statistics page provides up-to-date information and data on the number of articles, edits, users, and other metrics for all language versions of Wikipedia. It is meant to give a comprehensive overview of the growth and activity of the website over time.

2. How often is the Wikipedia statistics page updated?

The Wikipedia statistics page is updated every few minutes, with the exact frequency depending on the metric being tracked. Some metrics, such as the number of articles and users, are updated in real-time, while others, like the number of edits, may have a slight delay.

3. What is the meaning of the different metrics on the Wikipedia statistics page?

The different metrics on the Wikipedia statistics page represent different aspects of the website's activity and growth. For example, the "active users" metric refers to the number of registered users who have made at least one edit in the past 30 days, while the "pages" metric represents the total number of articles across all language versions of Wikipedia.

4. Is the Wikipedia statistics page accurate?

The Wikipedia statistics page is constantly updated and maintained by the Wikimedia Foundation, the non-profit organization that operates Wikipedia. While there may be some slight discrepancies due to the dynamic nature of the website, the statistics page is generally considered to be a reliable source of information.

5. How can I use the data on the Wikipedia statistics page?

The data on the Wikipedia statistics page can be used for research purposes, to track the growth and activity of the website, or to compare different language versions of Wikipedia. It can also be used to gain insights into the topics and languages that are most popular on the website.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
  • Special and General Relativity
Replies
11
Views
435
Replies
1
Views
743
  • Special and General Relativity
Replies
21
Views
1K
Replies
16
Views
1K
  • Special and General Relativity
Replies
16
Views
2K
  • Special and General Relativity
Replies
4
Views
1K
  • Advanced Physics Homework Help
Replies
6
Views
1K
  • Introductory Physics Homework Help
Replies
25
Views
285
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
Back
Top