ElasticSearch document types removed, why?

  • Comp Sci
  • Thread starter shivajikobardan
  • Start date
In summary, because we can't declare a field of different data types within a same index in different document types, this would cause problems.
  • #1
shivajikobardan
674
54
Homework Statement
Why did we remove multiple document types within an index in ElasticSearch?
Relevant Equations
None
The answer is this-:
Because we can't declare a field of different data types within a same index in different document types.

Say there's an index called "college".
Then there are document types called "student" "teacher" "administration" "staff".
What problem would occur if we allow this?

Books and documentations are saying that if a field called "date_of_join" is given a "text" data type in "student", then we can't give "date_of_join" as "date" data type in "staff".

It says that it's due to the way Lucene is.

This is because of the way Lucene maintains the field types in an index. As Lucene manages fields on an index level, there is no flexibility to declare two fields of different data types in the same index

But this is not clear without an example(of how lucene is storing index). Can you guys clarify this?
I know that lucene stores inverted indexes though. But still I'm not clear.
 
Physics news on Phys.org
  • #2
shivajikobardan said:
Homework Statement:: Why did we remove multiple document types within an index in ElasticSearch?

The answer is this-:
Because we can't declare a field of different data types within a same index in different document types.
  • One index can contain multiple document types, no problem.
  • Different document types can have fields with different names with different types, no problem.
  • But if different documents have fields with the same name they must be of the same type: this is obvious if you think about it (an index is essentially an ordering, you can't put things in both alphabetical order and date order at the same time).
Rather than remove one or both of the student and staff documents you could simply change one of the field names e.g. to date_of_join_textof course, or there is an even better solution using pre-processing: can you think what this is?
 
  • #3
pbuk said:
  • One index can contain multiple document types, no problem.
  • Different document types can have fields with different names with different types, no problem.
  • But if different documents have fields with the same name they must be of the same type: this is obvious if you think about it (an index is essentially an ordering, you can't put things in both alphabetical order and date order at the same time).
I genuinely don't see a problem.
say
date_of_join of student is "2022/2/2" (this is text format)
date_of_join of staff is "2011-11-11" (this is date format-just assume)
What's the problem?
Here's the inverted index-:

date_of_join=>doc1,2022/2/2; doc2,2011-11-11.

I don't see any problem. What's the problem. (Maybe some problem could occur when trying to parse it though as the code might only take date with yyyy-mm-dd format). Other than that, I see no problem, like the text is claiming.

pbuk said:
Rather than remove one or both of the student and staff documents you could simply change one of the field names e.g. to date_of_join_textof course, or there is an even better solution using pre-processing: can you think what this is?
I can't think of other preprocessing ideas.
 
  • #4
shivajikobardan said:
Here's the inverted index-:

date_of_join=>doc1,2022/2/2; doc2,2011-11-11.
That's a forward index, not an inverted index.
Here's the forward index after a few more documents:

date_of_join=>doc1,2022/2/2; doc2,2011-11-11; doc3,02/02/22; doc4,1/2/22; doc5,31/1/2022, doc6,1/31/2022; doc7,2022-02-31,doc8,Last Tuesday; doc9,Not provided...

How do you think the following query is going to handle that?
Code:
GET /_search
{
  "query": {
    "range": {
      "date_of_join": {
        "gte": "now-1y/M"
      }
    }
  }
}
 
  • #5
shivajikobardan said:
I can't think of other preprocessing ideas.
One idea could be to map date_of_join to two fields: date_of_join_date: date and date_of_join_text: string
 
  • #6
pbuk said:
That's a forward index, not an inverted index.
1658241108177.png

term-document arrangement is called inverted index though. what's the term here? ig the value is it ie 2022-02-02 or? (They use BKD trees for numeric data type though)
 
  • #7
shivajikobardan said:
term-document arrangement is called inverted index though. what's the term here? ig the value is it ie 2022-02-02 or?
You wrote this: date_of_join=>doc1,2022/2/2.

Even if you are confused as to what a 'term' is, I think it's pretty clear which is a document, and if we are looking for a term-document index then the document must come second.

I am not sure what is going wrong here, these concepts should not be difficult to grasp. Perhaps you should take a break.

shivajikobardan said:
(They use BKD trees for numeric data type though)
There is no numeric type involved here, just date and string. And the implementation details are entirely irrelevant to using Elasticsearch - they could change to a KDB tree and it wouldn't change anything.
 
Last edited:
  • #8
pbuk said:
You wrote this: date_of_join=>doc1,2022/2/2.

Even if you are confused as to what a 'term' is, I think it's pretty clear which is a document, and if we are looking for a term-document index then the document must come second.
Document is differently defined in different contexts.
1) Document of elasticsearch
2) document what normal people know
term-document means the figure like the one I posted above.
Like this-:
1658242922166.png

So anything with "document id" on the right side would be inverted because the classic data mining etc technique was to use document-term which would be sparse here.

date_of_join=>doc1,2022/2/2
Since document id is in right side, it should be an inverted index Isn't it?
But according to above analogy, I'd think

2022/2/2->doc1

should be inverted index more accurately.
 
  • #9
shivajikobardan said:
Since document id is in right side, it should be an inverted index Isn't it?
We are talking about doc1 -> 2022/2/2 where the document id is clearly on the left side.

shivajikobardan said:
I'd think 2022/2/2->doc should be inverted index more accurately.
Yes, that was my point, which is the exact opposite of what you have been saying.
 

1. What are document types in ElasticSearch and why were they removed?

Document types in ElasticSearch were a way to organize data within an index. They allowed for different types of documents to be stored within the same index. However, document types were removed in ElasticSearch version 6.0 because they were found to be a source of confusion and complexity, and were not necessary for efficient data organization.

2. How will the removal of document types impact my current ElasticSearch setup?

If you are currently using document types in your ElasticSearch setup, you will need to make some changes in order to upgrade to version 6.0 or higher. This may involve reorganizing your data into separate indices or using a different method of categorizing your documents.

3. What alternatives are available for organizing data without document types?

There are a few alternatives to using document types in ElasticSearch. One option is to use separate indices for different types of data. Another option is to use a field within the document to categorize it, such as a "type" field. Additionally, you can use parent-child relationships or nested objects to organize your data.

4. Will the removal of document types affect the performance of ElasticSearch?

The removal of document types is not expected to have a significant impact on the performance of ElasticSearch. In fact, it may improve performance as it simplifies the indexing process and reduces the complexity of queries.

5. Are there any potential drawbacks to the removal of document types?

Some users may find that the removal of document types makes it more difficult to organize and query their data. Additionally, if you are upgrading from an older version of ElasticSearch, you will need to make some changes to your setup. However, overall, the removal of document types is expected to have a positive impact on the usability and performance of ElasticSearch.

Similar threads

  • Programming and Computer Science
Replies
1
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
10
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
3
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
5
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
15
Views
1K
Replies
6
Views
1K
Replies
10
Views
960
Replies
4
Views
282
  • Set Theory, Logic, Probability, Statistics
2
Replies
35
Views
2K
Back
Top