Python Find Experimental Datasets for Python Modeling

  • Thread starter Thread starter Taylor_1989
  • Start date Start date
  • Tags Tags
    Experimental
AI Thread Summary
The discussion centers around finding experimental datasets for Python projects, particularly for modeling experiments like pendulum motion. The original poster seeks alternatives to Kaggle and mentions difficulties in finding suitable datasets. Suggestions include exploring NOAA data for greenhouse gases and tides, as well as checking Data.gov and UC Irvine's site for diverse datasets. Some participants offer access to raw data from undergraduate experiments in kinematics and mechanics, emphasizing the importance of using real experimental data for a more authentic learning experience. They discuss methods for generating synthetic data with realistic errors to simulate measurement inaccuracies, highlighting the challenges of creating data that accurately reflects real-world conditions. The conversation underscores the value of authentic datasets in scientific training and the complexities involved in simulating realistic experimental scenarios.
Taylor_1989
Messages
400
Reaction score
14
I am currently looking to improve my python skills and looking for some projects to do, one which came to mind was experimentally modeling in python. What I like to do is code an experiment say period of a pendulum and then compare the model to some data obtained in the lab, issue is I don't have access to a lab nor am I in education anymore and was wondering if anyone knew of any sites like Kaggel that deal with experimental raw data sets?

I did manage to find a couple on google which directed me to a site called Pizza and Chili but seem to be a bit of a dead loss, same with Kaggel only a few and even they were not really what I was looking for if anyone knows of any could they please post a link.

Thanks in advance.
 
Technology news on Phys.org
The NOAA has data available both for greenhouse gas concentrations as well as tides and lots of other things. I've mentored excellent student projects for both. See the citations in these two papers for some links:

https://arxiv.org/ftp/arxiv/papers/1812/1812.10402.pdf

https://arxiv.org/pdf/1507.01832.pdf

Unfortunately, I don't know of any central repository, though data in various fields is available through various sources.

Our lab has a number of raw data sets from videos of undergrad types of experiments in kinematics and mechanics. Stuff like this experiment: https://www.physicsforums.com/insights/an-accurate-simple-harmonic-oscillator-laboratory/

We don't intend to publish these data sets or all the videos, but we're willing to send them privately, especially if you're willing to analyze the raw videos using something like Tracker. Send me a PM if interested.
 
  • Like
Likes Taylor_1989
Have you checked Data.gov and UC Irvine's site? They both have massive and varied datasets.
 
  • Like
Likes Taylor_1989
You can produce your own random data as it is just for practice purposes. This way you could also play with the distribution function.
 
  • Like
Likes Taylor_1989 and jedishrfu
Yes, use the actual ideal solution to generate one test set and then use the same code but insert small errors to the data to simulate measurement inaccuracies and generate a second testset.

The errors should mimic the real world inaccuracies like distance in meters with an accuracy of +- 1cm would mean adding a range of say - 1.5< delta_cm < 1.5 cm to each:

delta_cm = random.random()*3.0 - 1.5

where the random.random() returns a random value between

https://pythonprogramminglanguage.com/randon-numbers/
 
  • Like
Likes Taylor_1989 and WWGD
fresh_42 said:
You can produce your own random data as it is just for practice purposes. This way you could also play with the distribution function.
jedishrfu said:
Yes, use the actual ideal solution to generate one test set and then use the same code but insert small errors to the data to simulate measurement inaccuracies and generate a second testset.

The errors should mimic the real world inaccuracies like distance in meters with an accuracy of +- 1cm would mean adding a range of say - 1.5< delta_cm < 1.5 cm to each:

delta_cm = random.random()*3.0 - 1.5

where the random.random() returns a random value between

https://pythonprogramminglanguage.com/randon-numbers/

I've mentored a couple student projects using the approach of generating data with a predictive model plus normally distributed random noise. Compared with using real experimental data, the learning process of the student was somewhat suboptimal, so I always figured out how the student could find and use real experimental data also. Even adding noise, computer generated data does not provide a fully authentic learning experience, since real experimental data often has imperfections other than the experimental uncertainties on the dependent variable.

Some of these can be simulated with additional effort. For example, jitter or error can be added to the independent variable as well using the same method used to generate a normally distributed appropriately scaled error for the dependent variable. But even this approach still assumes a data set in which the values of the independent variable are equally spaced, or nearly equally spaced, over an interval. Another approach is to generate random numbers for the independent variable in a given interval representing the anticipated measurement range of a proposed experiment. The point is that real experiments always have uncertainties in both variables, and often the independent variable is not as controlled as well as it is measured.

The learning process when we train aspiring scientists can and should include handing a wide variety of imperfections in experimental data, because that is what experimentalists tend to provide.
 
  • Like
Likes Taylor_1989 and mfb
I agree with you @Dr. Courtney. It can be very very difficult to generate realistic looking data.

One time i needed to create a dummy database of customer transaction data like when they clicked on a link, and when they bought stuff. And further to have it appear as multiple customers shopping in real time.

After every attempt, some pattern appeared in the data that could be tracked back to the generating program prompting us to try again. It tested the system we developed but real customer activity found the race condition bugs we were looking for.

In any event, it was a fun coding experience using AWK, weighted arrays plus random indices and SQL to load the database table data. A weighted array is an n element array where the values are repeated ie

1111112222222222222233334555666

So that randomly selecting an element will return a 2 its the most common value in the array.
 
Last edited:
  • Like
Likes Taylor_1989 and Dr. Courtney
WWGD said:
Have you checked Data.gov and UC Irvine's site? They both have massive and varied datasets.
I did check the Data.gov site but didn't really find anything that I was looking specifically, but the UC database is very interesting. Thank for the link.
 
  • Like
Likes jedishrfu and WWGD

Similar threads

Replies
6
Views
4K
Replies
7
Views
7K
Replies
1
Views
1K
Replies
6
Views
4K
Replies
3
Views
3K
Back
Top