Text file with all English words and their part of speech

  • Thread starter Thread starter Superposed_Cat
  • Start date Start date
  • Tags Tags
    English File Text
Click For Summary

Discussion Overview

The discussion revolves around the request for a text file containing all English words along with their parts of speech, relevant to natural language processing (NLP). Participants explore various resources and considerations regarding the availability and utility of such a file.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant expresses a need for a comprehensive text file of English words with part of speech tags, noting they had it previously but cannot locate it.
  • Another participant questions the practicality of having a file with all English words, suggesting that a more realistic list would contain only 20,000 to 30,000 words, as most speakers use a small fraction of the language.
  • A suggestion is made to use a specific dictionary file available on GitHub, which includes pronunciation.
  • A link to WordNet is provided, highlighting it as a large lexical database that groups words into synsets and is often used in NLP applications.
  • Another participant mentions the Brown Corpus as a well-known corpus that includes parts of speech, while also noting that there is no consensus among computational linguists regarding parts of speech categorization.
  • A link to the NLTK data sources is shared, indicating that it includes WordNet and the Brown Corpus among other resources useful for NLP.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the necessity or practicality of a complete list of English words with parts of speech. Multiple viewpoints regarding the size and utility of such a list are presented, indicating ongoing debate.

Contextual Notes

There are limitations regarding the definitions of what constitutes "all English words" and the varying opinions on the importance of parts of speech data for different NLP tasks.

Superposed_Cat
Messages
388
Reaction score
5
Hey all, been wanting to get into NLP (natural language processing) but I require a text file with a list of all English words (not the definitions) and a tag indicating their part of speech, I know it exists because I had it on my old laptop but I can't seem to refind it. Any help apreciated.
 
Technology news on Phys.org
Superposed_Cat said:
Hey all, been wanting to get into NLP (natural language processing) but I require a text file with a list of all English words (not the definitions) and a tag indicating their part of speech, I know it exists because I had it on my old laptop but I can't seem to refind it. Any help apreciated.
ALL the words in English? That's going to be one hell of a file. And mostly useless. Of the 1,000,000+ words in English (depending on who you believe), an average speaker has a vocab of about 6,000 to 8,000 words and a highly educated one has under 20,000 so even highly educated English speakers use less than 2% of the words in the language (and may have "receptive" knowledge of another 1% or less). I suspect that your list problably had 20,000 to 30,000 words, not "all" the words in English.
 
I won't be able to help you find your file, but if you want a dictionary with words in it https://github.com/TheBerkin/Rantionary/blob/master/Prepositions.dic is one. It has pronunciation as well.
 
Last edited by a moderator:
http://wordnet.princeton.edu/
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations

These guys are often used as corpora for natural language, and their database is downloadable (free). Python NLTK uses this, as do a lot of other NLP libraries.
 
Last edited:
  • Like
Likes   Reactions: jim mcnamara
You might want to search for the 'Brown Corpus', one of the earliest best known corpus with parts of speech. I don't think any two groups of computational linguists agree on the parts of speech; you may not even need parts of speech data depending on what you're doing.
 
http://www.nltk.org/nltk_data/

That's the complete list of sources used by the Python natural language toolkit. Wordnet and Brown Corpus are in there, as are others. That's quite a good library.
 

Similar threads

Replies
65
Views
5K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
1
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K