Please give me an example of how any indexing works in big data search

In summary, the conversation centers around the topic of learning about indexing and searching, specifically in the context of Lucene. The individual is having difficulty finding resources and is looking for feedback and guidance. They mention a taxonomy of big data indexes and how it is not a simple concept. A relevant video and slide are provided, as well as a summary of the concept of manipulation in the context of searching for 'special relativity'. The individual expresses a need for an example of manipulation and mentions another document that may provide clarification.
  • #1
shivajikobardan
674
54
Homework Statement
indexing and searching big data,lucene indexing process full,distributed searching with elastic search
Relevant Equations
none
[Mentor Note -- PF thread and MHB threads merged together below due to MHB forum merger with PF]

I have to learn in context of lucene, but firstly, I want to learn the example indexing in general.
Sth like this-:

And I am not getting any google books and pdfs to learn about these topics. I basically need to learn basics of indexing and searching,indexing with lucene full process in detail, elastic search etc. I haven't googled elastic search yet, but I am not finding much information for the first twos. I believe I am not hitting the right google query. Any feedback would be really helpful as we don't have official textbook here and this topic I haven't seen in many general big data books as well.
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
I have to learn in context of lucene, but firstly, I want to learn the example indexing in general.
Sth like this-:

And I am not getting any google books and pdfs to learn about these topics. I basically need to learn basics of indexing and searching,indexing with lucene full process in detail, elastic search etc. I haven't googled elastic search yet, but I am not finding much information for the first twos. I believe I am not hitting the right google query. Any feedback would be really helpful as we don't have official textbook here and this topic I haven't seen in many general big data books as well.
 
  • #3
There is a taxonomy of big data indexes, so it is not as simple as you seem to have assumed:

https://d1wqtxts1xzle7.cloudfront.net/61504256/A_survey_on_Indexing_Techniques_for_Big_20191213-17789-1fk1xfk-with-cover-page-v2.pdf?Expires=1657429301&Signature=A~hBFUftKoRMyTR~5Ss6meg4wuT-V63Y3CMyAxN3xaA73eSM6LRK8SiwP2vptGYCocqG2gyP7NGzTixaJofLGf8eKLo01nruK7-9TAsT27iKjY~APa0bJeZM68IRFTi8URlYZ7FZpFqywW9FMZwxQnGpct37CWyAEpiKTGXlkRlLBg8tT2sPy1BnAtb1ZKt~sXeEBgidRlkZzKNQB6DhXuKr9vcnXa0nuaOCIDZQoG0zSo204n4nMRt33WTYQjYWfWnEnyMLZHUBry1on~dRNl-XkFU2M2skFRQ6fapEvVp23m2DMrdeeFThfZbAycs9Ep1HH1s~vUAV4A1FJIpudA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA

The key might break in this link so try google for the same result:

"big data index taxonomy"
 
  • #4
jim mcnamara said:
There is a taxonomy of big data indexes, so it is not as simple as you seem to have assumed:

https://d1wqtxts1xzle7.cloudfront.net/61504256/A_survey_on_Indexing_Techniques_for_Big_20191213-17789-1fk1xfk-with-cover-page-v2.pdf?Expires=1657429301&Signature=A~hBFUftKoRMyTR~5Ss6meg4wuT-V63Y3CMyAxN3xaA73eSM6LRK8SiwP2vptGYCocqG2gyP7NGzTixaJofLGf8eKLo01nruK7-9TAsT27iKjY~APa0bJeZM68IRFTi8URlYZ7FZpFqywW9FMZwxQnGpct37CWyAEpiKTGXlkRlLBg8tT2sPy1BnAtb1ZKt~sXeEBgidRlkZzKNQB6DhXuKr9vcnXa0nuaOCIDZQoG0zSo204n4nMRt33WTYQjYWfWnEnyMLZHUBry1on~dRNl-XkFU2M2skFRQ6fapEvVp23m2DMrdeeFThfZbAycs9Ep1HH1s~vUAV4A1FJIpudA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA

The key might break in this link so try google for the same result:

"big data index taxonomy"
bad link. Here's one that works:
https://www.researchgate.net/publication/273082158_A_survey_on_Indexing_Techniques_for_Big_Data_Taxonomy_and_Performance_Evaluation
 
  • Like
Likes BvU and jim mcnamara
  • #5
phinds said:
bad link. Here's one that works:
https://www.researchgate.net/publication/273082158_A_survey_on_Indexing_Techniques_for_Big_Data_Taxonomy_and_Performance_Evaluation
thanks but this didn't help. i think what I'm needing is related to information retrieval systems. I'm googling
 
  • #6
jim mcnamara said:
There is a taxonomy of big data indexes, so it is not as simple as you seem to have assumed:

https://d1wqtxts1xzle7.cloudfront.net/61504256/A_survey_on_Indexing_Techniques_for_Big_20191213-17789-1fk1xfk-with-cover-page-v2.pdf?Expires=1657429301&Signature=A~hBFUftKoRMyTR~5Ss6meg4wuT-V63Y3CMyAxN3xaA73eSM6LRK8SiwP2vptGYCocqG2gyP7NGzTixaJofLGf8eKLo01nruK7-9TAsT27iKjY~APa0bJeZM68IRFTi8URlYZ7FZpFqywW9FMZwxQnGpct37CWyAEpiKTGXlkRlLBg8tT2sPy1BnAtb1ZKt~sXeEBgidRlkZzKNQB6DhXuKr9vcnXa0nuaOCIDZQoG0zSo204n4nMRt33WTYQjYWfWnEnyMLZHUBry1on~dRNl-XkFU2M2skFRQ6fapEvVp23m2DMrdeeFThfZbAycs9Ep1HH1s~vUAV4A1FJIpudA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA

The key might break in this link so try google for the same result:

"big data index taxonomy"
hmm it should be but this didn't have too many google answers.
 
  • #7
I am very near to getting this. I got indexing(as per needed) now I am near to getting how searching works(not elastic search just the basics of "how inverted index searching works". I am confused in the application of step 3.

Here's a relevant video.



This slide also covers the concept. but no example is given, i want one example of manipulation (while i have a inituitive feeling of what might be going on).

https://slidetodoc.com/modern-information-retrieval-chapter-8-indexing-and-searching-3/
 
Last edited:
  • #8
shivajikobardan said:
i want one example of manipulation
Lets say we are searching for 'special relativity'. The retreival stage may return 500 documents with the words 'special' and 'relativity'; we manipulate the results to place those where the words are close together near the top of the list.
 
  • #9
pbuk said:
Lets say we are searching for 'special relativity'. The retreival stage may return 500 documents with the words 'special' and 'relativity'; we manipulate the results to place those where the words are close together near the top of the list.
yeah it's like that. there is another document that makes it even more clear..
 
  • #10
In contrast to working with an indexed data set, Indexing is an optimization method.

We can use your "Google" search example, but since we don't know everything Google does, let's make our own search engine - say "ShivaFind".
So, as an example, let's image we are using our new "ShivaFind" search engine with these search terms: crazy river salad

In the simplest case, ShivaFind could simply read every page on the looking for any page that contains all three of those words. That would work, but we wouldn't want to have to wait for the results - we're looking for results in seconds, not years.

So ShivaFind will start reading the web long before we try to search for anything - and it will build up some indices:
* The first index will be a list of all the web pages that have been scanned. Whenever we discover another web page, we will add it to this list. Then, from that point on we will refer to that page by its position in that list. So if the tenth (10th) web page we discovered is https://www.physicsforums.com/threads/please-give-me-an-example-of-how-any-indexing-works-in-big-data-searches.1044569/, the that string of characters will be added to the list and from that point on we will refer to this page by the number 10. In this case, the index is simply acting like a glossary so that we can convert is abbreviation ("10") into the full URL.
* The second index is the one described in your video. It will be a list of lists. Whenever we find a new word, we will add that word to the list and add a list of places we have found that word to the list with it.

That second index (the list of lists) is our word index. We are going to optimize that list of lists so that when you enter your search terms, we can find the pages very quickly. There are two optimizations we will be applying:

Optimization number 1: We will use a hash index - here's how it will work: Before we start scanning the web and collecting words, we will set up a server with a large RAM storage array divided into 65536 records. Then we will choose a hash function that can take any word and turn it into a number from 0 to 65535. We will then store a small "word index" that includes that word in the record with that number. In our example, let's say the search term "crazy" is encoded in ASCII - so its 5 bytes. The hash will be some function of those 40 bits - followed by a "modulo 65536" to make sure we stay in the range of 0 to 65535. Our hash of "crazy" might give us the number 2080 - so we add the word "crazy" to record number 2080 along with a disk address where we will store the "crazy" word-page list. Now, if we need to get information about "crazy", we do a quick hash, read its record and we now have a short list or words that hash to 2080 - and those are the only words we need to search through. Each of the words in that list includes the disk address for that word that tells us where to get the word-page list for that word. Let's say that the disk address for "crazy" is 123, 4567. If we now go to SSD drive number 123, sector 4567 we will find a list of pages where that number occurs - and those pages are specified using the first index. So, the record at 123,4567 will include the number "10" which is the index to this forum page.

Optimization number 2: We are optimizing for the search operation - not the data collection/indexing operation. So when we find a new page and begin adding its words to the word-page lists, we will order them in a way that makes combining lists faster. We don't need to discuss exactly what that means - but that hashing trick we just used with words could give us a boost when applied to the entries in the word-page lists.

So for example, "ShivaFind crazy river salad":
Very fast operations:
- Hash "crazy" to 2080, scan its record and find 123,4567.
- Hash "river" to 12345, scan its record and find 44,5555.
- Hash "salad" to 23456, scan its record and find 99,1111.

Fast operations, but require going out to you big data server:
- Read from those three disk records scanning for 3-way matches.
- For each match, look up the URL from the first index and report it.
 
Last edited:
  • Like
Likes shivajikobardan and 256bits
  • #11
shivajikobardan said:
I have to learn in context of lucene, but firstly, I want to learn the example indexing in general.
Sth like this-:

And I am not getting any google books and pdfs to learn about these topics. I basically need to learn basics of indexing and searching,indexing with lucene full process in detail, elastic search etc. I haven't googled elastic search yet, but I am not finding much information for the first twos. I believe I am not hitting the right google query. Any feedback would be really helpful as we don't have official textbook here and this topic I haven't seen in many general big data books as well.

Here is a write up on Indexed data bases.
By the NoSQL, interpret that as meaning Not Only SQL, rather than No SQL, depending upon the data base.
https://ils.unc.edu/courses/2018_fall/inls523_004/nosql.pdf
 
  • #12
.Scott said:
Indexing is an optimism method.
I'm not sure what you mean by that, but the rest of the post was a really good summary!

A minor correction: you have repeated "crazy" three times:
.Scott said:
So for example, "ShivaFind crazy river salad":
Very fast operations:
- Hash "crazy" to 2080, scan its record and find 123,4567.
- Hash "crazy" to 12345, scan its record and find 44,5555.
- Hash "crazy" to 23456, scan its record and find 99,1111.
 
  • Like
Likes .Scott
  • #13
pbuk said:
I'm not sure what you mean by that, but the rest of the post was a really good summary!

A minor correction: you have repeated "crazy" three times:
I have corrected my spelling of "optimization", flashed out my point on that, and corrected those "crazy" cut and paste errors.

Thanks!
 
  • Like
Likes shivajikobardan and pbuk
  • #14
.Scott said:
In contrast to working with an indexed data set, Indexing is an optimization method.

We can use your "Google" search example, but since we don't know everything Google does, let's make our own search engine - say "ShivaFind".
So, as an example, let's image we are using our new "ShivaFind" search engine with these search terms: crazy river salad

In the simplest case, ShivaFind could simply read every page on the looking for any page that contains all three of those words. That would work, but we wouldn't want to have to wait for the results - we're looking for results in seconds, not years.

So ShivaFind will start reading the web long before we try to search for anything - and it will build up some indices:
* The first index will be a list of all the web pages that have been scanned. Whenever we discover another web page, we will add it to this list. Then, from that point on we will refer to that page by its position in that list. So if the tenth (10th) web page we discovered is https://www.physicsforums.com/threads/please-give-me-an-example-of-how-any-indexing-works-in-big-data-searches.1044569/, the that string of characters will be added to the list and from that point on we will refer to this page by the number 10. In this case, the index is simply acting like a glossary so that we can convert is abbreviation ("10") into the full URL.
* The second index is the one described in your video. It will be a list of lists. Whenever we find a new word, we will add that word to the list and add a list of places we have found that word to the list with it.

That second index (the list of lists) is our word index. We are going to optimize that list of lists so that when you enter your search terms, we can find the pages very quickly. There are two optimizations we will be applying:

Optimization number 1: We will use a hash index - here's how it will work: Before we start scanning the web and collecting words, we will set up a server with a large RAM storage array divided into 65536 records. Then we will choose a hash function that can take any word and turn it into a number from 0 to 65535. We will then store a small "word index" that includes that word in the record with that number. In our example, let's say the search term "crazy" is encoded in ASCII - so its 5 bytes. The hash will be some function of those 40 bits - followed by a "modulo 65536" to make sure we stay in the range of 0 to 65535. Our hash of "crazy" might give us the number 2080 - so we add the word "crazy" to record number 2080 along with a disk address where we will store the "crazy" word-page list. Now, if we need to get information about "crazy", we do a quick hash, read its record and we now have a short list or words that hash to 2080 - and those are the only words we need to search through. Each of the words in that list includes the disk address for that word that tells us where to get the word-page list for that word. Let's say that the disk address for "crazy" is 123, 4567. If we now go to SSD drive number 123, sector 4567 we will find a list of pages where that number occurs - and those pages are specified using the first index. So, the record at 123,4567 will include the number "10" which is the index to this forum page.

Optimization number 2: We are optimizing for the search operation - not the data collection/indexing operation. So when we find a new page and begin adding its words to the word-page lists, we will order them in a way that makes combining lists faster. We don't need to discuss exactly what that means - but that hashing trick we just used with words could give us a boost when applied to the entries in the word-page lists.

So for example, "ShivaFind crazy river salad":
Very fast operations:
- Hash "crazy" to 2080, scan its record and find 123,4567.
- Hash "river" to 12345, scan its record and find 44,5555.
- Hash "salad" to 23456, scan its record and find 99,1111.

Fast operations, but require going out to you big data server:
- Read from those three disk records scanning for 3-way matches.
- For each match, look up the URL from the first index and report it.
thank you for the information
 
  • #15
Thread closed for Moderation...
 
  • #16
Thread is re-opened. As a reminder, this is a schoolwork question. Please wait for the OP to actually show their work on this. It was originally misplaced in the technical forums, so that may have confused some of the folks who replied. Thank you.
 
  • #17
Update -- From the dates on this thread, it appears that @shivajikobardan originally started this thread at MHB in their technical forums (they did not have the same schoolwork rules as PF does), and when MHB was merged with PF, this thread ended up in our technical forums. It is now in our schoolwork forums.
 
  • Like
Likes shivajikobardan

1. How does indexing work in big data search?

Indexing in big data search involves creating a searchable structure or database that contains information about the data being searched. This structure allows for faster and more efficient retrieval of data, as the search engine does not have to scan through every single piece of data to find a match.

2. What is the purpose of indexing in big data search?

The purpose of indexing is to improve the speed and accuracy of data retrieval in big data search. By creating a searchable structure, the search engine can quickly locate and retrieve relevant data, making the search process more efficient and effective.

3. Can you give an example of how indexing works in big data search?

Imagine you have a large database of customer information, including names, addresses, and phone numbers. By creating an index based on last names, the search engine can quickly locate all customers with a specific last name, rather than having to scan through the entire database for a match.

4. How is indexing different from traditional search methods?

In traditional search methods, the search engine must scan through all the data to find a match, which can be time-consuming and inefficient. With indexing, the search engine can quickly access the relevant data through the searchable structure, making the search process much faster and more accurate.

5. Are there any downsides to using indexing in big data search?

While indexing can greatly improve the speed and efficiency of data retrieval, it does require additional resources and time to create and maintain the index. Additionally, if the index is not updated regularly, it may not accurately reflect the current data, leading to potential errors in search results.

Similar threads

  • Engineering and Comp Sci Homework Help
Replies
4
Views
725
  • Programming and Computer Science
Replies
5
Views
2K
  • Programming and Computer Science
Replies
1
Views
624
  • STEM Academic Advising
Replies
23
Views
937
  • Science and Math Textbooks
Replies
2
Views
2K
Replies
1
Views
93
Replies
1
Views
2K
  • STEM Academic Advising
Replies
1
Views
1K
  • Math Proof Training and Practice
4
Replies
105
Views
12K
  • Science and Math Textbooks
Replies
1
Views
2K
Back
Top