Find Experimental Datasets for Python Modeling

  • Context: Python 
  • Thread starter Thread starter Taylor_1989
  • Start date Start date
  • Tags Tags
    Experimental
Click For Summary
SUMMARY

This discussion focuses on finding experimental datasets for Python modeling, particularly for simulating physical experiments like pendulum motion. Participants recommend various sources, including NOAA for greenhouse gas and tide data, Data.gov, and UC Irvine's datasets. Additionally, they suggest generating synthetic data to practice modeling, emphasizing the importance of incorporating realistic measurement inaccuracies. The conversation highlights the challenges of obtaining authentic experimental data and the need for students to engage with real-world imperfections in datasets.

PREREQUISITES
  • Understanding of Python programming and libraries for data analysis.
  • Familiarity with data generation techniques, including adding noise to datasets.
  • Knowledge of experimental physics concepts, particularly in kinematics and mechanics.
  • Experience with data repositories like NOAA, Data.gov, and UC Irvine's datasets.
NEXT STEPS
  • Explore NOAA's datasets for environmental data relevant to modeling.
  • Investigate Data.gov for a variety of public datasets across different fields.
  • Learn about generating synthetic datasets in Python, including noise addition techniques.
  • Research methods for simulating measurement inaccuracies in experimental data.
USEFUL FOR

Data scientists, educators, and students in physics or engineering looking to enhance their Python modeling skills with real-world experimental datasets.

Taylor_1989
Messages
400
Reaction score
14
I am currently looking to improve my python skills and looking for some projects to do, one which came to mind was experimentally modeling in python. What I like to do is code an experiment say period of a pendulum and then compare the model to some data obtained in the lab, issue is I don't have access to a lab nor am I in education anymore and was wondering if anyone knew of any sites like Kaggel that deal with experimental raw data sets?

I did manage to find a couple on google which directed me to a site called Pizza and Chili but seem to be a bit of a dead loss, same with Kaggel only a few and even they were not really what I was looking for if anyone knows of any could they please post a link.

Thanks in advance.
 
Technology news on Phys.org
The NOAA has data available both for greenhouse gas concentrations as well as tides and lots of other things. I've mentored excellent student projects for both. See the citations in these two papers for some links:

https://arxiv.org/ftp/arxiv/papers/1812/1812.10402.pdf

https://arxiv.org/pdf/1507.01832.pdf

Unfortunately, I don't know of any central repository, though data in various fields is available through various sources.

Our lab has a number of raw data sets from videos of undergrad types of experiments in kinematics and mechanics. Stuff like this experiment: https://www.physicsforums.com/insights/an-accurate-simple-harmonic-oscillator-laboratory/

We don't intend to publish these data sets or all the videos, but we're willing to send them privately, especially if you're willing to analyze the raw videos using something like Tracker. Send me a PM if interested.
 
  • Like
Likes   Reactions: Taylor_1989
Have you checked Data.gov and UC Irvine's site? They both have massive and varied datasets.
 
  • Like
Likes   Reactions: Taylor_1989
You can produce your own random data as it is just for practice purposes. This way you could also play with the distribution function.
 
  • Like
Likes   Reactions: Taylor_1989 and jedishrfu
Yes, use the actual ideal solution to generate one test set and then use the same code but insert small errors to the data to simulate measurement inaccuracies and generate a second testset.

The errors should mimic the real world inaccuracies like distance in meters with an accuracy of +- 1cm would mean adding a range of say - 1.5< delta_cm < 1.5 cm to each:

delta_cm = random.random()*3.0 - 1.5

where the random.random() returns a random value between

https://pythonprogramminglanguage.com/randon-numbers/
 
  • Like
Likes   Reactions: Taylor_1989 and WWGD
fresh_42 said:
You can produce your own random data as it is just for practice purposes. This way you could also play with the distribution function.
jedishrfu said:
Yes, use the actual ideal solution to generate one test set and then use the same code but insert small errors to the data to simulate measurement inaccuracies and generate a second testset.

The errors should mimic the real world inaccuracies like distance in meters with an accuracy of +- 1cm would mean adding a range of say - 1.5< delta_cm < 1.5 cm to each:

delta_cm = random.random()*3.0 - 1.5

where the random.random() returns a random value between

https://pythonprogramminglanguage.com/randon-numbers/

I've mentored a couple student projects using the approach of generating data with a predictive model plus normally distributed random noise. Compared with using real experimental data, the learning process of the student was somewhat suboptimal, so I always figured out how the student could find and use real experimental data also. Even adding noise, computer generated data does not provide a fully authentic learning experience, since real experimental data often has imperfections other than the experimental uncertainties on the dependent variable.

Some of these can be simulated with additional effort. For example, jitter or error can be added to the independent variable as well using the same method used to generate a normally distributed appropriately scaled error for the dependent variable. But even this approach still assumes a data set in which the values of the independent variable are equally spaced, or nearly equally spaced, over an interval. Another approach is to generate random numbers for the independent variable in a given interval representing the anticipated measurement range of a proposed experiment. The point is that real experiments always have uncertainties in both variables, and often the independent variable is not as controlled as well as it is measured.

The learning process when we train aspiring scientists can and should include handing a wide variety of imperfections in experimental data, because that is what experimentalists tend to provide.
 
  • Like
Likes   Reactions: Taylor_1989 and mfb
I agree with you @Dr. Courtney. It can be very very difficult to generate realistic looking data.

One time i needed to create a dummy database of customer transaction data like when they clicked on a link, and when they bought stuff. And further to have it appear as multiple customers shopping in real time.

After every attempt, some pattern appeared in the data that could be tracked back to the generating program prompting us to try again. It tested the system we developed but real customer activity found the race condition bugs we were looking for.

In any event, it was a fun coding experience using AWK, weighted arrays plus random indices and SQL to load the database table data. A weighted array is an n element array where the values are repeated ie

1111112222222222222233334555666

So that randomly selecting an element will return a 2 its the most common value in the array.
 
Last edited:
  • Like
Likes   Reactions: Taylor_1989 and Dr. Courtney
WWGD said:
Have you checked Data.gov and UC Irvine's site? They both have massive and varied datasets.
I did check the Data.gov site but didn't really find anything that I was looking specifically, but the UC database is very interesting. Thank for the link.
 
  • Like
Likes   Reactions: jedishrfu and WWGD

Similar threads

  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 7 ·
Replies
7
Views
8K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K