What is a Data Lake? Understanding the Buzzword

  • Thread starter Thread starter lomidrevo
  • Start date Start date
  • Tags Tags
    Data Lake
Click For Summary

Discussion Overview

The discussion revolves around the concept of a "data lake," exploring its definition, implications, and whether it is merely a buzzword. Participants examine various interpretations and applications of the term, including its relation to data processing technologies and marketing language.

Discussion Character

  • Exploratory, Debate/contested, Conceptual clarification

Main Points Raised

  • One participant defines a data lake as a repository of data stored in its raw format, encompassing various types of data including structured, semi-structured, unstructured, and binary data.
  • Another participant notes the ambiguity surrounding the term, suggesting it can refer to different technologies such as ETL processes, distributed file systems like Apache Hadoop HDFS, NoSQL databases with SQL support, or proprietary architectures.
  • Some participants express skepticism about the term, questioning whether it is simply a buzzword that lacks a consistent definition.
  • A later reply introduces a metaphorical perspective, suggesting that 'data lake' serves as a conceptual reservoir for cloud technologies, while critiquing the use of such terms as often more prevalent in marketing than in engineering.

Areas of Agreement / Disagreement

Participants generally express uncertainty regarding the definition of a data lake, with multiple competing views on its meaning and implications. The discussion remains unresolved, with no consensus on whether it is a legitimate concept or merely a buzzword.

Contextual Notes

Limitations include the lack of a clear, universally accepted definition of a data lake, as well as the dependence on varying interpretations by different authors and stakeholders.

lomidrevo
Messages
433
Reaction score
250
I think the basic idea is quite clear, as for example defined by wikipedia:
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

But when I google more about this "technology", I am getting quite various ideas about what is considered as data lake. Some of them:
  • just a synonym to ETL approach to data processing
  • a distributed file system, like Apache Hadoop HDFS
  • NoSQL database with additional support of SQL, like for example MondogDB
  • or some proprietary architecture involving all of that and maybe some extra tools, like reporting, visualization and maybe machine learning?

How do you understand the term data lake? Is it just a buzzword?
 
  • Like
Likes   Reactions: sysprog
Computer science news on Phys.org
lomidrevo said:
Is it just a buzzword?
Yes. It can mean whatever the author wants it to mean.
 
  • Like
Likes   Reactions: sysprog and lomidrevo
pbuk said:
Yes. It can mean whatever the author wants it to mean.
that is my current impression, thanks :)
 
Maybe the 'data lake' is the 'reservoir' that engenders and sustains the 'cloud' ##-## I think that such metaphors are used for enablement of non-rigorous semblances of understanding ##-## I have encountered use of such fanciful terms much more by marketers than by engineers.
 
  • Like
Likes   Reactions: lomidrevo

Similar threads

  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 50 ·
2
Replies
50
Views
9K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K