Text file with all English words and their part of speech

Superposed_Cat · Feb 16, 2015

Hey all, been wanting to get into NLP (natural language processing) but I require a text file with a list of all English words (not the definitions) and a tag indicating their part of speech, I know it exists because I had it on my old laptop but I can't seem to refind it. Any help apreciated.

phinds · Feb 16, 2015

Superposed_Cat said:

Hey all, been wanting to get into NLP (natural language processing) but I require a text file with a list of all English words (not the definitions) and a tag indicating their part of speech, I know it exists because I had it on my old laptop but I can't seem to refind it. Any help apreciated.

ALL the words in English? That's going to be one hell of a file. And mostly useless. Of the 1,000,000+ words in English (depending on who you believe), an average speaker has a vocab of about 6,000 to 8,000 words and a highly educated one has under 20,000 so even highly educated English speakers use less than 2% of the words in the language (and may have "receptive" knowledge of another 1% or less). I suspect that your list problably had 20,000 to 30,000 words, not "all" the words in English.

TheDemx27 · Feb 16, 2015

I won't be able to help you find your file, but if you want a dictionary with words in it https://github.com/TheBerkin/Rantionary/blob/master/Prepositions.dic is one. It has pronunciation as well.

Carno Raar · Feb 16, 2015

http://wordnet.princeton.edu/

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations

These guys are often used as corpora for natural language, and their database is downloadable (free). Python NLTK uses this, as do a lot of other NLP libraries.

ScottSalley · Feb 17, 2015

You might want to search for the 'Brown Corpus', one of the earliest best known corpus with parts of speech. I don't think any two groups of computational linguists agree on the parts of speech; you may not even need parts of speech data depending on what you're doing.

Carno Raar · Feb 18, 2015

http://www.nltk.org/nltk_data/

That's the complete list of sources used by the Python natural language toolkit. Wordnet and Brown Corpus are in there, as are others. That's quite a good library.

Text file with all English words and their part of speech

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

PHP My website presents the visitor with the choice of opting out of using cookies....

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect