What are some other dataset formats commonly used for large data?

  • Thread starter Thread starter fog37
  • Start date Start date
  • Tags Tags
    File
AI Thread Summary
The discussion highlights various dataset formats relevant for handling large data, emphasizing the distinction between structured and unstructured data. Common formats mentioned include CSV, TSV, Excel, JSON, and HTML for tabular datasets, while SQL is noted as a language rather than a dataset format. The conversation also clarifies that folders can function similarly to relational databases, mapping keys to values, although this method can be inefficient for data access. Efficient dataset formats are characterized by their ability to enable fast lookups, searches on multiple keys, and efficient insertions and deletions. Additional formats discussed include structured text-based formats like XML and Markdown, binary files, and specialized formats such as HDF5, which is commonly used for large binary datasets. The conversation underscores the importance of choosing the right format for data management and retrieval efficiency.
fog37
Messages
1,566
Reaction score
108
TL;DR Summary
various dataset formats
Hello,

I am familiar with some of the popular dataset formats. For example, there are tabular dataset formats like CSV, TSV, xls (excel), JSON, html, etc.
For relational datasets, an example is SQL. I guess regular folders work for unstructured data like emails and images.

Is there any other dataset formats that should be mentioned in the context of large data?

Thanks!
 
Technology news on Phys.org
SQL is a language, not a dataset format.
A directory (folder) is a relational database and acts very much like a phone book, mapping names (the key) to a number. A folder might be implemented in any number of internal methods, not all of which have names for their formats.

So for instance, a folder could be implemented in CSV format, simply mapping a filename to a number like "file1",25
"myPasswords",71

This would be very inefficient since any attempt to access the file would require a reading of the entire list up to the line that matches. A paper phone book optimizes this search by putting the keys in alphabetical order, which makes it difficult to insert a new entry, and allows efficient search only on one key.
An efficient format allows fast lookup, searches on multiple keys, and efficient insertions and deletions.
 
  • Like
Likes fog37
There are numerous dataset formats:
- structured text-based ones like html, xml, json, markdown and many other varients
- property files like windows properties, java properties
- general simple fixed record binary files
- general variable sized record binary files
- structured database files like isam.
- memory dump files...
- serialized data files
- numerous structured binary files like zip, tar, hd5, netcdf, lib, jar files...
- and the list continues after an interminably long break...
 
  • Like
  • Informative
Likes fog37, Jarvis323 and Klystron
hdf5 is often used for large binary datasets.
 
  • Like
Likes fog37, jedishrfu and Jarvis323
Dear Peeps I have posted a few questions about programing on this sectio of the PF forum. I want to ask you veterans how you folks learn program in assembly and about computer architecture for the x86 family. In addition to finish learning C, I am also reading the book From bits to Gates to C and Beyond. In the book, it uses the mini LC3 assembly language. I also have books on assembly programming and computer architecture. The few famous ones i have are Computer Organization and...
I have a quick questions. I am going through a book on C programming on my own. Afterwards, I plan to go through something call data structures and algorithms on my own also in C. I also need to learn C++, Matlab and for personal interest Haskell. For the two topic of data structures and algorithms, I understand there are standard ones across all programming languages. After learning it through C, what would be the biggest issue when trying to implement the same data...
Back
Top