Discussion Overview
The discussion revolves around the request for a text file containing all English words along with their parts of speech, relevant to natural language processing (NLP). Participants explore various resources and considerations regarding the availability and utility of such a file.
Discussion Character
- Exploratory
- Technical explanation
- Debate/contested
Main Points Raised
- One participant expresses a need for a comprehensive text file of English words with part of speech tags, noting they had it previously but cannot locate it.
- Another participant questions the practicality of having a file with all English words, suggesting that a more realistic list would contain only 20,000 to 30,000 words, as most speakers use a small fraction of the language.
- A suggestion is made to use a specific dictionary file available on GitHub, which includes pronunciation.
- A link to WordNet is provided, highlighting it as a large lexical database that groups words into synsets and is often used in NLP applications.
- Another participant mentions the Brown Corpus as a well-known corpus that includes parts of speech, while also noting that there is no consensus among computational linguists regarding parts of speech categorization.
- A link to the NLTK data sources is shared, indicating that it includes WordNet and the Brown Corpus among other resources useful for NLP.
Areas of Agreement / Disagreement
Participants do not reach a consensus on the necessity or practicality of a complete list of English words with parts of speech. Multiple viewpoints regarding the size and utility of such a list are presented, indicating ongoing debate.
Contextual Notes
There are limitations regarding the definitions of what constitutes "all English words" and the varying opinions on the importance of parts of speech data for different NLP tasks.