Most efficient way to randomly choose a word from a file with a list of words

Wrichik Basu · Mar 29, 2024

I have a file consisting of a bunch of words, with one word in each line. For example, consider this webpage, or this one, but instead of being online, there is a .txt file is locally on my PC. My aim is to pick a word at random from this list that starts with a specific given letter. Say I want a random word that starts with "n" or "p". Both upper and lower cases are acceptable.

What is the most efficient and fast approach to pick out a word with a given starting letter from this massive list in Python (3.12, if the version matters)? Reading a file takes time, so either I have to read the whole file into a list/set and keep it in memory (which will be expensive on memory), or I can write the entire file to a MySQL/sqLite3 database. I want to use this in a bot that will stay online all the time, so I want response times to be as short as possible. Also, with a small memory overhead, since the host I use charges if I need higher memory. So in that sense, probably keeping it in an indexed database will be faster. But I am not sure how to select a word that starts with a particular letter from a database.

As always, any ideas are appreciated.

jedishrfu · Mar 29, 2024

one could preprocess the list sorting words into distinct files like a-words, b-words, c-words...

The letter determines the word list to load into a numeric array and then use the python random function to pick an index into the array.

Wrichik Basu · Mar 29, 2024

jedishrfu said:

one could preprocess the list sorting words into distinct files like a-words, b-words, c-words...

The letter determines the word list to load into a numeric array and then use the python random function to pick an index into the array.

Interesting, probably I can do that in a database too! One column with the first letter of each word, another column with the word itself. Then proceed as shown in this answer on SO:

SQL:

SELECT words FROM table_name ORDER BY RANDOM() LIMIT 1 WHERE initial_letter = 'a';

PeterDonis · Mar 29, 2024

Wrichik Basu said:

What is the most efficient and fast approach to pick out a word with a given starting letter from this massive list in Python (3.12, if the version matters)?

If speed is the primary consideration, load the entire file into memory and construct a Python dictionary from it that maps each letter to a function that takes no arguments and returns a randomly chosen word from the list of words in your file that start with that letter. That will give you the fastest response since once everything is initialized all you are doing is a dictionary lookup and a single function call, with no file or database access required.

However, you say:

Wrichik Basu said:

a small memory overhead

What exactly counts as "small"? And how does that compare to the memory that would be needed to do it the way I described above? A lot will depend on how large your file of words is, since that will determine how much memory it takes to store all the necessary Python objects.

Wrichik Basu · Mar 29, 2024

PeterDonis said:

What exactly counts as "small"?

Less than 256 MB to be specific. Your idea is good, I will check how much memory is needed if I need to do it that way.

PeterDonis · Mar 29, 2024

Wrichik Basu said:

I want to use this in a bot that will stay online all the time, so I want response times to be as short as possible.

One thing to note here is that if your bot is online and you are talking to it over an Internet connection, network latency will probably dominate your response time--i.e., you most likely won't even be able to see the difference in response time between different implementations (in-memory vs. file access vs. database), unless you use a database implementation where the database file is not local to the bot but has to be accessed over a separate network connection. (For SQLite this wouldn't be an issue since you would just co-locate the SQLite file and the bot's code, but for something like MySQL you might not be able to guarantee that the bot and the MySQL server will be running on the same machine.)

Also, one way to reduce network latency, at least assuming both ends have reliably good Internet connections, is to use a websocket to talk to the bot. Setting up the initial websocket connection will have the same latency as a normal HTTP connection, but after that subsequent queries, as long as the websocket remains active, should be quicker because no handshake is required.

Baluncore · Mar 29, 2024

I have built a list of words, sorted like the examples in the OP. It is only about 3.4 Mbyte. Thanks for the links, I will run them by mine to see what I am missing.

Preprocess, to find the first position of each first letter in the file.
Generate a random number and interpolate proportionally into the letter block with that first letter.

jedishrfu · Mar 30, 2024

Some other tricks to consider:
- have an HTML list selector where the user selects the first letter
- download those words to the web page
- do the work in the browser

Wrichik Basu · Mar 30, 2024

PeterDonis said:

For SQLite this wouldn't be an issue since you would just co-locate the SQLite file and the bot's code, but for something like MySQL you might not be able to guarantee that the bot and the MySQL server will be running on the same machine.

I am using SQLite so that the database file is local to the bot. My host has an option of a MySQL database (free of charge), but I don't use it because of issues related to backing up the data. With an SQLite file, I can just download the file to create a backup.

PeterDonis said:

Also, one way to reduce network latency, at least assuming both ends have reliably good Internet connections, is to use a websocket to talk to the bot. Setting up the initial websocket connection will have the same latency as a normal HTTP connection, but after that subsequent queries, as long as the websocket remains active, should be quicker because no handshake is required.

It's a Discord bot, and I am not really sure if Discord allows one to configure how the connection will be made. Will have to search that a bit.

jack action · Mar 30, 2024

This seems trivial as a list of words is something very small - database-wise - with only 25 000 items (freebsd list) or even 10 000 items (mit list). I downloaded the largest list on my computer in a file named word-freebsd.txt to which I added the word word as the first line (the file size is 210 kB). Then I ran sqlite3 words.txt to create a new SQLite database (I chose the txt extension to be able to upload it on PF). In SQLite, I executed this simple code:

SQL:

.mode csv word
.import words-freebsd.txt word

Voilà! I ended up with a 385 kB database with the table word with one column named word. I went further by creating an index (I'm not even sure how faster it can make a request, but still):

SQL:

CREATE UNIQUE INDEX word_index ON word (word);

Now my database is at 778 kB and I can run a simple query like this one to get what you want (also possible without the index):

SQL:

SELECT word FROM word WHERE word LIKE 'a%' ORDER BY RANDOM() LIMIT 1;

It is lightning fast. So small, I can even attach the database to this post! I'm pretty sure your 256 MB RAM will be sufficient.

EDIT: Apparently, PF loads the .txt file when I attach it to the post but doesn't make it available.

pbuk · Mar 31, 2024

SQL of any form is unnecessary here; @jedishrfu/ @Baluncore's solution is a good one.

If your word lists are truly enormous so that the numpy.vector index does not fit in memory, preprocess the list so that each word is padded to equal length then you only need to store the number of entries for each initial letter, or alternatively use a separate index file.

jedishrfu · Mar 31, 2024

To add some anecdotal info to @pbuk post:

While working on a datamining project some years ago. We would score a SQL star schema fact table in IBM DB/2 but found that we could double the speed of scoring by dumping the table to a flat file and scoring it there and then updating the scoring field in each fact table record.

At the time, we were scoring records using 250+ attributes of the record that was saved in the database as a star schema ie one central fact table and several ancillary dimension tables. These all had to be collected together to compute a score inside the database engine which can become quite costly.

We would dump the star schema to a flat file resembling a spreadsheet file (ala comma separated values) and then score the flat file. Scores were then reloaded into the database.

Star schemas are great for reporting type queries but not so much for datamining work.

Filip Larsen · Mar 31, 2024

I may have missed it being already mentioned, but for a file or set of files too large to keep in memory remember that you can seek and read from a file without storing the full file content in memory. E.g. if you want an approximate random line from a large file with one word per line then you could seek to a random position in the file and then take the next (or previous if no next) full line. This should work even if the file is unordered, but if you need the first letter to be fixed it is likely most efficient, as already mentioned, to split the single file into a file for each initial letter, but this can also be done without sorting and without storing the full content in memory.

Tom.G · Apr 1, 2024

You would do well by reading:

THE ART OF COMPUTER PROGRAMMING
Donald E. Knuth
Volume 3 / Sorting and Searching
Chapter 6, SEARCHING

ISBN 0-201-03803X
Addison-Wesley Company, Reading Massachusetts

The copyright is 1973, but the series was considered the "Bible" for many years, and is still a useful reference.

Th other volumes are:
Vol. 1 FUNDAMENTAL ALGORITHYMS
Vol. 2 SEMINUMERICAL ALGORITHYMS
And Vol. 4, published much later and I don't recall the title
.
.
.
Just did a Goggle search:
https://www.google.com/search?hl=en&q=the+art+of+computer+programming
which turned up PDF versions, and Vols. 4a, 4b, 5

Have Fun!

Cheers,
Tom

pbuk · Apr 1, 2024

jedishrfu said:

Star schemas are great for reporting type queries but not so much for datamining work.

Yes, the key word in RDBMS is relational - if you don't have relational data (e.g. customers/orders/products/stock levels) then you don't need SQL.

jedishrfu · Apr 1, 2024

True. To be clear we were building tools for a data warehouse system where query is king. The scoring was done periodically as data was updated during batch runs.

to use or not use a relational database is a design issue. It makes a lot of sense for large amounts of data or data that is highly interconnected (where star schemas shine) that you have to query a lot but not so much for small amounts of data where you can spend more time doing full scans of the dat to get what you want.

10,000 for me was the magic number where I considered doing things in a database. Also if I felt the application might grow into something more then I'd consider a relational database. So much depends on speed, storage, future growth and optimization that there is no hard and fast rule for what approach is best especially for systems where the customer may say I going to use it for X but then you find they are more likely to use it for Y.

In one system I designed I learned a hard lesson in optimization. We were to keep the data for ten days. The data record was 200 or so data points updated every hour coming in from multiple sources. The most common query was for only 10 data points of interest from each record from a given source.

I designed for that setup breaking up the data points into small individual chunks so one record was expanded into 200 rows... This approach minimized network traffic to only 10 data point rows being over a given time period sent to a display app. I added indexing tables to speed the search for data and things worked great until the ten day mark.

At ten days, the delete logic kicked in and thrashing began as the database was being updated every hour with 200 rows while the delete logic would search and delete 200 rows too including the associated index tables. The system dragged to a crawl.

I finally realized I had to give up the notion of storing each data point in its own row and let the display app pick out the data points it got back form the list of 200 points. Network traffic wasn't appreciably affected but the database update and delete was dramatically sped up.

A most humbling experience with my project manager (non-programmer) fretting about it over my shoulder and my manager who trusted in my ability to fix it quickly.

pbuk · Apr 1, 2024

Filip Larsen said:

E.g. if you want an approximate random line from a large file with one word per line then you could seek to a random position in the file and then take the next (or previous if no next) full line.

But that would bias selections towards longer words.

Unless you make them all the same length by padding them.

In which case you can seek to exactly the right position which is the method suggested in #2 and #7.

pbuk · Apr 1, 2024

jedishrfu said:

True. To be clear we were building tools for a data warehouse system where query is king. The scoring was done periodically as data was updated during batch runs.

Oh sorry, my comment

pbuk said:

the key word in RDBMS is relational - if you don't have relational data (e.g. customers/orders/products/stock levels) then you don't need SQL.

was intended for the context of this thread, not your interesting anecdote.

jedishrfu said:

to use or not use a relational database is a design issue. It makes a lot of sense for large amounts of data or data that is highly interconnected (where star schemas shine) that you have to query a lot

IMHO the real strength of an RDBMS is not when you have to query a lot - if all you want to do is read then you are usually better off with indexed static files of redundant data, it is when you have data that is added to/updated a lot - so flat files won't work.

Nugatory · Apr 1, 2024

Wrichik Basu said:

What is the most efficient and fast approach….

You are asking the wrong question here. You don’t want the fastest approach, you want one that is fast enough. Decide how many lookups a second you’ll be doing and how much time you can spend on each lookup, and then choose the simplest implementation that meets those requirements.

Filip Larsen · Apr 2, 2024

pbuk said:

But that would bias selections towards longer words.

Indeed, that is why I called it "approximate random". But my point was mainly to point to the option of seeking in a file without having to load all content, an option that nowadays often is overlooked.

jedishrfu · Apr 2, 2024

Or one could build an ISAM file which is one step above a random file where you'd need to build an index to where each first letter starts so you can jump there and begin reading.

https://en.wikipedia.org/wiki/ISAM

ISAM files use a paging scheme where each page holds several records and there are indexing pages that point to these pages.

Each common programming language should have a library of ISAM functions to implement this style of data storage.

PeterDonis · Apr 2, 2024

jedishrfu said:

Each common programming language should have a library of ISAM functions to implement this style of data storage.

I'm not sure this is true. ISAM is typically part of the "under the hood" implementation of databases like MySQL or Berkeley DB, and the programming language libraries or bindings that exist are for those higher level implementations, not for ISAM itself. There are actual ISAM libraries for C and C++, I believe, but not necessarily for other programming languages.

Most efficient way to randomly choose a word from a file with a list of words

Similar threads

Hot Threads

Recent Insights