In contrast to working with an indexed data set, Indexing is an optimization method.
We can use your "Google" search example, but since we don't know everything Google does, let's make our own search engine - say "ShivaFind".
So, as an example, let's image we are using our new "ShivaFind" search engine with these search terms: crazy river salad
In the simplest case, ShivaFind could simply read every page on the looking for any page that contains all three of those words. That would work, but we wouldn't want to have to wait for the results - we're looking for results in seconds, not years.
So ShivaFind will start reading the web long before we try to search for anything - and it will build up some indices:
* The first index will be a list of all the web pages that have been scanned. Whenever we discover another web page, we will add it to this list. Then, from that point on we will refer to that page by its position in that list. So if the tenth (10th) web page we discovered is https://www.physicsforums.com/threads/please-give-me-an-example-of-how-any-indexing-works-in-big-data-searches.1044569/, the that string of characters will be added to the list and from that point on we will refer to this page by the number 10. In this case, the index is simply acting like a glossary so that we can convert is abbreviation ("10") into the full URL.
* The second index is the one described in your video. It will be a list of lists. Whenever we find a new word, we will add that word to the list and add a list of places we have found that word to the list with it.
That second index (the list of lists) is our word index. We are going to optimize that list of lists so that when you enter your search terms, we can find the pages very quickly. There are two optimizations we will be applying:
Optimization number 1: We will use a hash index - here's how it will work: Before we start scanning the web and collecting words, we will set up a server with a large RAM storage array divided into 65536 records. Then we will choose a hash function that can take any word and turn it into a number from 0 to 65535. We will then store a small "word index" that includes that word in the record with that number. In our example, let's say the search term "crazy" is encoded in ASCII - so its 5 bytes. The hash will be some function of those 40 bits - followed by a "modulo 65536" to make sure we stay in the range of 0 to 65535. Our hash of "crazy" might give us the number 2080 - so we add the word "crazy" to record number 2080 along with a disk address where we will store the "crazy" word-page list. Now, if we need to get information about "crazy", we do a quick hash, read its record and we now have a short list or words that hash to 2080 - and those are the only words we need to search through. Each of the words in that list includes the disk address for that word that tells us where to get the word-page list for that word. Let's say that the disk address for "crazy" is 123, 4567. If we now go to SSD drive number 123, sector 4567 we will find a list of pages where that number occurs - and those pages are specified using the first index. So, the record at 123,4567 will include the number "10" which is the index to this forum page.
Optimization number 2: We are optimizing for the search operation - not the data collection/indexing operation. So when we find a new page and begin adding its words to the word-page lists, we will order them in a way that makes combining lists faster. We don't need to discuss exactly what that means - but that hashing trick we just used with words could give us a boost when applied to the entries in the word-page lists.
So for example, "ShivaFind crazy river salad":
Very fast operations:
- Hash "crazy" to 2080, scan its record and find 123,4567.
- Hash "river" to 12345, scan its record and find 44,5555.
- Hash "salad" to 23456, scan its record and find 99,1111.
Fast operations, but require going out to you big data server:
- Read from those three disk records scanning for 3-way matches.
- For each match, look up the URL from the first index and report it.