Python Most efficient way to randomly choose a word from a file with a list of words

PeterDonis · May 10, 2024

Filip Larsen said:

if whatever partial information is needed from a large file can be retrieved without reading and parsing the whole file and storing it memory first, then it likely will be more performant to not load the whole file in memory

Even leaving aside the web bot issue, the OP requirements, as far as I can see, cannot be met without loading and parsing the whole file in some way, since you have to randomly select a word. If the file is sorted by initial letter (as at least one of the ones linked to in the OP is), you might be able to get away with just loading the portion of it contaning words beginning with the chosen letter to randomly select a word, but even for that you would have to know what portion that is in advance, i.e., you would have had to load the entire file and parse it in order to generate the information about what portion of the file contains words starting with each letter. You could do the latter in a pre-processing step and store the results in a second file, I suppose.

Vanadium 50 · May 10, 2024

PeterDonis said:

cannot be met without loading and parsing the whole file

Sure it can. In the Bad Old Days this was done all the tine. You get your letter, say Q, and the file header tells you where in the file the Q's start, and you start reading from there.

However, just because you can do it this way does not mean you should. The data file is not large - it is small. An unreasonable version is ~10 MB and a reasonable one 5-10% of that. One floppy disk (remember those?).

This is not a lot of data, and one should not attack the problem as if there were.

PeterDonis · May 10, 2024

Vanadium 50 said:

the file header

If there is one that contains the necessary data. In the examples the OP linked to, there wasn't.

PeterDonis · May 10, 2024

Vanadium 50 said:

One floppy disk (remember those?).

I do, yes. My first PC only had floppy drives (two of them), and I had things configured to use a RAM disk for frequently used files because loading them from floppy was so slow.

Vanadium 50 · May 10, 2024

Fortunately, HDDs are faster than floppies. They can read that much data in a fraction of a second. I have small dictionary as part of a program (12K words) and the loading time is a small fraction of a second.

If instead of "header", I wrote "index" would that be clearer? The OPs dictionary need not be a flat file. It can have a more complex structure, like an index at the front. This is at least a 50 year old solution.

Fundamentally, though, this is not a lot of data. Treating this as if it did is likely not to lead to he optimal solution.

PeterDonis · May 10, 2024

Vanadium 50 said:

The OPs dictionary need not be a flat file.

Yes, files like the ones the OP linked to could be pre-processed to add an index section at the front.

Python Most efficient way to randomly choose a word from a file with a list of words

Similar threads

How to increase phone signal strength by lying about it

A Crisis for Newly Minted CompSci Majors -- entry level jobs gone

How to calculate Tension for a series of connected points?

Learning Assembly and computer architecture for x86

Learning data structures and algorithms in different programming languages

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers