What are some other dataset formats commonly used for large data?

fog37 · Oct 9, 2020

Hello,

I am familiar with some of the popular dataset formats. For example, there are tabular dataset formats like CSV, TSV, xls (excel), JSON, html, etc.
For relational datasets, an example is SQL. I guess regular folders work for unstructured data like emails and images.

Is there any other dataset formats that should be mentioned in the context of large data?

Thanks!

Halc · Oct 9, 2020

SQL is a language, not a dataset format.
A directory (folder) is a relational database and acts very much like a phone book, mapping names (the key) to a number. A folder might be implemented in any number of internal methods, not all of which have names for their formats.

So for instance, a folder could be implemented in CSV format, simply mapping a filename to a number like "file1",25
"myPasswords",71

This would be very inefficient since any attempt to access the file would require a reading of the entire list up to the line that matches. A paper phone book optimizes this search by putting the keys in alphabetical order, which makes it difficult to insert a new entry, and allows efficient search only on one key.
An efficient format allows fast lookup, searches on multiple keys, and efficient insertions and deletions.

jedishrfu · Oct 9, 2020

There are numerous dataset formats:
- structured text-based ones like html, xml, json, markdown and many other varients
- property files like windows properties, java properties
- general simple fixed record binary files
- general variable sized record binary files
- structured database files like isam.
- memory dump files...
- serialized data files
- numerous structured binary files like zip, tar, hd5, netcdf, lib, jar files...
- and the list continues after an interminably long break...

phyzguy · Oct 9, 2020

hdf5 is often used for large binary datasets.

What are some other dataset formats commonly used for large data?

Thread 'Learning Assembly and computer architecture for x86'

Thread 'Learning data structures and algorithms in different programming languages'

Thread 'A Crisis for Newly Minted CompSci Majors -- entry level jobs gone'

Similar threads

Hot Threads

Hackathon ideas?

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Trying To Debug A Python File

Python Complaining About Python

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective