Comp Sci ElasticSearch document types removed, why?

  • Thread starter Thread starter shivajikobardan
  • Start date Start date
AI Thread Summary
The removal of document types in ElasticSearch is primarily due to the inability to declare fields with different data types under the same index for fields sharing the same name. For instance, if "date_of_join" is defined as "text" for one document type, it cannot be defined as "date" for another within the same index. This limitation arises from how Lucene manages field types at the index level, necessitating uniformity for fields with identical names. While different document types can have distinct field names and types, the shared field names must maintain the same data type to avoid conflicts. The discussion emphasizes the importance of understanding Lucene's structure to grasp the rationale behind this design choice.
shivajikobardan
Messages
637
Reaction score
54
Homework Statement
Why did we remove multiple document types within an index in ElasticSearch?
Relevant Equations
None
The answer is this-:
Because we can't declare a field of different data types within a same index in different document types.

Say there's an index called "college".
Then there are document types called "student" "teacher" "administration" "staff".
What problem would occur if we allow this?

Books and documentations are saying that if a field called "date_of_join" is given a "text" data type in "student", then we can't give "date_of_join" as "date" data type in "staff".

It says that it's due to the way Lucene is.

This is because of the way Lucene maintains the field types in an index. As Lucene manages fields on an index level, there is no flexibility to declare two fields of different data types in the same index

But this is not clear without an example(of how lucene is storing index). Can you guys clarify this?
I know that lucene stores inverted indexes though. But still I'm not clear.
 
Physics news on Phys.org
shivajikobardan said:
Homework Statement:: Why did we remove multiple document types within an index in ElasticSearch?

The answer is this-:
Because we can't declare a field of different data types within a same index in different document types.
  • One index can contain multiple document types, no problem.
  • Different document types can have fields with different names with different types, no problem.
  • But if different documents have fields with the same name they must be of the same type: this is obvious if you think about it (an index is essentially an ordering, you can't put things in both alphabetical order and date order at the same time).
Rather than remove one or both of the student and staff documents you could simply change one of the field names e.g. to date_of_join_textof course, or there is an even better solution using pre-processing: can you think what this is?
 
pbuk said:
  • One index can contain multiple document types, no problem.
  • Different document types can have fields with different names with different types, no problem.
  • But if different documents have fields with the same name they must be of the same type: this is obvious if you think about it (an index is essentially an ordering, you can't put things in both alphabetical order and date order at the same time).
I genuinely don't see a problem.
say
date_of_join of student is "2022/2/2" (this is text format)
date_of_join of staff is "2011-11-11" (this is date format-just assume)
What's the problem?
Here's the inverted index-:

date_of_join=>doc1,2022/2/2; doc2,2011-11-11.

I don't see any problem. What's the problem. (Maybe some problem could occur when trying to parse it though as the code might only take date with yyyy-mm-dd format). Other than that, I see no problem, like the text is claiming.

pbuk said:
Rather than remove one or both of the student and staff documents you could simply change one of the field names e.g. to date_of_join_textof course, or there is an even better solution using pre-processing: can you think what this is?
I can't think of other preprocessing ideas.
 
shivajikobardan said:
Here's the inverted index-:

date_of_join=>doc1,2022/2/2; doc2,2011-11-11.
That's a forward index, not an inverted index.
Here's the forward index after a few more documents:

date_of_join=>doc1,2022/2/2; doc2,2011-11-11; doc3,02/02/22; doc4,1/2/22; doc5,31/1/2022, doc6,1/31/2022; doc7,2022-02-31,doc8,Last Tuesday; doc9,Not provided...

How do you think the following query is going to handle that?
Code:
GET /_search
{
  "query": {
    "range": {
      "date_of_join": {
        "gte": "now-1y/M"
      }
    }
  }
}
 
shivajikobardan said:
I can't think of other preprocessing ideas.
One idea could be to map date_of_join to two fields: date_of_join_date: date and date_of_join_text: string
 
pbuk said:
That's a forward index, not an inverted index.
1658241108177.png

term-document arrangement is called inverted index though. what's the term here? ig the value is it ie 2022-02-02 or? (They use BKD trees for numeric data type though)
 
shivajikobardan said:
term-document arrangement is called inverted index though. what's the term here? ig the value is it ie 2022-02-02 or?
You wrote this: date_of_join=>doc1,2022/2/2.

Even if you are confused as to what a 'term' is, I think it's pretty clear which is a document, and if we are looking for a term-document index then the document must come second.

I am not sure what is going wrong here, these concepts should not be difficult to grasp. Perhaps you should take a break.

shivajikobardan said:
(They use BKD trees for numeric data type though)
There is no numeric type involved here, just date and string. And the implementation details are entirely irrelevant to using Elasticsearch - they could change to a KDB tree and it wouldn't change anything.
 
Last edited:
pbuk said:
You wrote this: date_of_join=>doc1,2022/2/2.

Even if you are confused as to what a 'term' is, I think it's pretty clear which is a document, and if we are looking for a term-document index then the document must come second.
Document is differently defined in different contexts.
1) Document of elasticsearch
2) document what normal people know
term-document means the figure like the one I posted above.
Like this-:
1658242922166.png

So anything with "document id" on the right side would be inverted because the classic data mining etc technique was to use document-term which would be sparse here.

date_of_join=>doc1,2022/2/2
Since document id is in right side, it should be an inverted index Isn't it?
But according to above analogy, I'd think

2022/2/2->doc1

should be inverted index more accurately.
 
shivajikobardan said:
Since document id is in right side, it should be an inverted index Isn't it?
We are talking about doc1 -> 2022/2/2 where the document id is clearly on the left side.

shivajikobardan said:
I'd think 2022/2/2->doc should be inverted index more accurately.
Yes, that was my point, which is the exact opposite of what you have been saying.
 

Similar threads

Replies
1
Views
2K
Replies
10
Views
3K
Replies
12
Views
2K
Replies
3
Views
2K
Replies
15
Views
2K
Replies
4
Views
3K
Back
Top