Looking for large dataset of non image-centric physics data

  • Thread starter Thread starter galliaproject
  • Start date Start date
  • Tags Tags
    Data Physics
AI Thread Summary
A user is seeking large, non-image-centric physics datasets for documentation related to the Gallia library, specifically looking for examples similar in size and complexity to the ICGC data for skin cancer. They express frustration with existing resources that do not provide direct links to datasets. Suggestions from other users include the Sloan Digital Sky Survey (SDSS), which features data on over 3 million astronomical objects in multiple formats, and the Large Hadron Collider (LHC) data, which contains particle collision information and is accessible through the CERN Open Data Portal. Both datasets meet the criteria of being large and manageable in terms of row size for processing with Gallia.
galliaproject
How did you find PF?: Google search

Hi all,

I'm looking for a good example of a large dataset of non image-centric physics data (e.g. astronomy, particles, ...) so I can add an example to this section of my documentation (formal announcement for the Gallia library: see Scala users forum).

I looked around for instance on the Awesome public datasets page but it doesn't link directly to the data in most cases, and I've gotten lost going to the rabbit holes too many times already. I think I'm just not familiar enough with the domain. For reference, here's what a great counterpart in bioinformatics data would be for what I need: ICGC data for skin cancer (it adds up to ~40GB of data once uncompressed).

It'd be nice if the dataset was similarly large-ish (despite no images) in terms for "rows", as in: not fitting your typical customer-grade computer's memory. It could be in pretty much any format among json/tsv/csv/avro/parquet and the likes (see Gallia's input section), and it doesn't have to be all in one file either. It can't however have millions or billions of columns, a single "row" has to fit memory at this time as Gallia is not a particularly column-focused data processing tool.

Any pointers would be greatly appreciated!

Thanks,

Anthony
 
Physics news on Phys.org


Hello Anthony,

I often use large datasets for my research and I have come across a few that may be of interest to you for your project. One dataset that comes to mind is the Sloan Digital Sky Survey (SDSS) which contains information on over 3 million astronomical objects. It is available in various formats, including SQL and FITS, and can be accessed through their website or through the Virtual Observatory. Another dataset that may be useful is the Large Hadron Collider (LHC) data, which contains information on particle collisions from experiments at CERN. This data is available in various formats, including ROOT and HDF5, and can be accessed through the CERN Open Data Portal.

I hope this helps with your search for a suitable dataset. Good luck with your project!
 
Similar to the 2024 thread, here I start the 2025 thread. As always it is getting increasingly difficult to predict, so I will make a list based on other article predictions. You can also leave your prediction here. Here are the predictions of 2024 that did not make it: Peter Shor, David Deutsch and all the rest of the quantum computing community (various sources) Pablo Jarrillo Herrero, Allan McDonald and Rafi Bistritzer for magic angle in twisted graphene (various sources) Christoph...
Thread 'My experience as a hostage'
I believe it was the summer of 2001 that I made a trip to Peru for my work. I was a private contractor doing automation engineering and programming for various companies, including Frito Lay. Frito had purchased a snack food plant near Lima, Peru, and sent me down to oversee the upgrades to the systems and the startup. Peru was still suffering the ills of a recent civil war and I knew it was dicey, but the money was too good to pass up. It was a long trip to Lima; about 14 hours of airtime...
Back
Top