What are some other dataset formats commonly used for large data?

  • Thread starter fog37
  • Start date
  • Tags
    File
In summary, there are various dataset formats available for different types of data, such as tabular formats like CSV, TSV, xls, JSON, and HTML, as well as relational databases like SQL. For unstructured data, folders can also serve as a dataset format. Other examples include structured text-based formats like html, xml, and json, property files like windows properties and java properties, and various structured binary files. HDF5 is commonly used for large binary datasets.
  • #1
fog37
1,568
108
TL;DR Summary
various dataset formats
Hello,

I am familiar with some of the popular dataset formats. For example, there are tabular dataset formats like CSV, TSV, xls (excel), JSON, html, etc.
For relational datasets, an example is SQL. I guess regular folders work for unstructured data like emails and images.

Is there any other dataset formats that should be mentioned in the context of large data?

Thanks!
 
Technology news on Phys.org
  • #2
SQL is a language, not a dataset format.
A directory (folder) is a relational database and acts very much like a phone book, mapping names (the key) to a number. A folder might be implemented in any number of internal methods, not all of which have names for their formats.

So for instance, a folder could be implemented in CSV format, simply mapping a filename to a number like "file1",25
"myPasswords",71

This would be very inefficient since any attempt to access the file would require a reading of the entire list up to the line that matches. A paper phone book optimizes this search by putting the keys in alphabetical order, which makes it difficult to insert a new entry, and allows efficient search only on one key.
An efficient format allows fast lookup, searches on multiple keys, and efficient insertions and deletions.
 
  • Like
Likes fog37
  • #3
There are numerous dataset formats:
- structured text-based ones like html, xml, json, markdown and many other varients
- property files like windows properties, java properties
- general simple fixed record binary files
- general variable sized record binary files
- structured database files like isam.
- memory dump files...
- serialized data files
- numerous structured binary files like zip, tar, hd5, netcdf, lib, jar files...
- and the list continues after an interminably long break...
 
  • Like
  • Informative
Likes fog37, Jarvis323 and Klystron
  • #4
hdf5 is often used for large binary datasets.
 
  • Like
Likes fog37, jedishrfu and Jarvis323

What is a file format for a dataset?

A file format for a dataset refers to the structure and organization of data within a file. It determines how the data is stored and can impact the accessibility and usability of the dataset.

Why is it important to choose the right file format for a dataset?

The right file format for a dataset is important because it can affect the accuracy, completeness, and integrity of the data. It can also impact the ability to share and analyze the data effectively.

What are some common file formats for datasets?

Some common file formats for datasets include CSV (Comma Separated Values), JSON (JavaScript Object Notation), XML (Extensible Markup Language), and Excel spreadsheets.

How do I determine which file format is best for my dataset?

The best file format for a dataset depends on the type and structure of the data, as well as the intended use of the dataset. Consider factors such as data size, compatibility with software and tools, and the need for data transformation when deciding on a file format.

Can I convert a dataset from one file format to another?

Yes, it is possible to convert a dataset from one file format to another using various software or online tools. However, the conversion process may result in data loss or discrepancies, so it is important to carefully consider the implications before making any conversions.

Similar threads

Replies
1
Views
709
  • Computing and Technology
Replies
3
Views
1K
  • Programming and Computer Science
Replies
11
Views
2K
  • Programming and Computer Science
Replies
1
Views
5K
Replies
1
Views
2K
  • Programming and Computer Science
Replies
18
Views
5K
Replies
7
Views
3K
  • STEM Career Guidance
Replies
5
Views
2K
Replies
10
Views
2K
Back
Top