What are some other dataset formats commonly used for large data?

  • Thread starter Thread starter fog37
  • Start date Start date
  • Tags Tags
    File
Click For Summary

Discussion Overview

The discussion revolves around various dataset formats commonly used for large data, including structured and unstructured formats. Participants explore different types of formats, their characteristics, and their applications in handling large datasets.

Discussion Character

  • Exploratory, Technical explanation, Debate/contested

Main Points Raised

  • Some participants mention popular dataset formats such as CSV, TSV, Excel (xls), JSON, and HTML for tabular datasets.
  • One participant points out that SQL is a language rather than a dataset format.
  • Another participant describes directories (folders) as analogous to relational databases, suggesting they map keys to values but may not have a standardized format.
  • A participant lists numerous dataset formats, including structured text-based formats like HTML, XML, JSON, and various binary file types.
  • HDF5 is highlighted as a commonly used format for large binary datasets.

Areas of Agreement / Disagreement

Participants express differing views on the classification of SQL and the nature of folders as dataset formats. There is no consensus on a definitive list of dataset formats, and multiple competing views remain regarding the categorization and efficiency of various formats.

Contextual Notes

Some claims about the efficiency of formats and their implementations are not fully resolved, and there may be missing assumptions regarding the definitions of dataset formats and their applications.

fog37
Messages
1,566
Reaction score
108
TL;DR
various dataset formats
Hello,

I am familiar with some of the popular dataset formats. For example, there are tabular dataset formats like CSV, TSV, xls (excel), JSON, html, etc.
For relational datasets, an example is SQL. I guess regular folders work for unstructured data like emails and images.

Is there any other dataset formats that should be mentioned in the context of large data?

Thanks!
 
Technology news on Phys.org
SQL is a language, not a dataset format.
A directory (folder) is a relational database and acts very much like a phone book, mapping names (the key) to a number. A folder might be implemented in any number of internal methods, not all of which have names for their formats.

So for instance, a folder could be implemented in CSV format, simply mapping a filename to a number like "file1",25
"myPasswords",71

This would be very inefficient since any attempt to access the file would require a reading of the entire list up to the line that matches. A paper phone book optimizes this search by putting the keys in alphabetical order, which makes it difficult to insert a new entry, and allows efficient search only on one key.
An efficient format allows fast lookup, searches on multiple keys, and efficient insertions and deletions.
 
  • Like
Likes   Reactions: fog37
There are numerous dataset formats:
- structured text-based ones like html, xml, json, markdown and many other varients
- property files like windows properties, java properties
- general simple fixed record binary files
- general variable sized record binary files
- structured database files like isam.
- memory dump files...
- serialized data files
- numerous structured binary files like zip, tar, hd5, netcdf, lib, jar files...
- and the list continues after an interminably long break...
 
  • Like
  • Informative
Likes   Reactions: fog37, Jarvis323 and Klystron
hdf5 is often used for large binary datasets.
 
  • Like
Likes   Reactions: fog37, jedishrfu and Jarvis323

Similar threads

  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 1 ·
Replies
1
Views
6K
Replies
7
Views
3K
  • · Replies 18 ·
Replies
18
Views
6K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 1 ·
Replies
1
Views
92
  • · Replies 13 ·
Replies
13
Views
7K