Find Experimental Datasets for Python Modeling

Taylor_1989 · Dec 3, 2019

I am currently looking to improve my python skills and looking for some projects to do, one which came to mind was experimentally modeling in python. What I like to do is code an experiment say period of a pendulum and then compare the model to some data obtained in the lab, issue is I don't have access to a lab nor am I in education anymore and was wondering if anyone knew of any sites like Kaggel that deal with experimental raw data sets?

I did manage to find a couple on google which directed me to a site called Pizza and Chili but seem to be a bit of a dead loss, same with Kaggel only a few and even they were not really what I was looking for if anyone knows of any could they please post a link.

Thanks in advance.

Dr. Courtney · Dec 3, 2019

The NOAA has data available both for greenhouse gas concentrations as well as tides and lots of other things. I've mentored excellent student projects for both. See the citations in these two papers for some links:

https://arxiv.org/ftp/arxiv/papers/1812/1812.10402.pdf

https://arxiv.org/pdf/1507.01832.pdf

Unfortunately, I don't know of any central repository, though data in various fields is available through various sources.

Our lab has a number of raw data sets from videos of undergrad types of experiments in kinematics and mechanics. Stuff like this experiment: https://www.physicsforums.com/insights/an-accurate-simple-harmonic-oscillator-laboratory/

We don't intend to publish these data sets or all the videos, but we're willing to send them privately, especially if you're willing to analyze the raw videos using something like Tracker. Send me a PM if interested.

WWGD · Dec 3, 2019

Have you checked Data.gov and UC Irvine's site? They both have massive and varied datasets.

fresh_42 · Dec 3, 2019

You can produce your own random data as it is just for practice purposes. This way you could also play with the distribution function.

jedishrfu · Dec 3, 2019

Yes, use the actual ideal solution to generate one test set and then use the same code but insert small errors to the data to simulate measurement inaccuracies and generate a second testset.

The errors should mimic the real world inaccuracies like distance in meters with an accuracy of +- 1cm would mean adding a range of say - 1.5< delta_cm < 1.5 cm to each:

delta_cm = random.random()*3.0 - 1.5

where the random.random() returns a random value between

https://pythonprogramminglanguage.com/randon-numbers/

Dr. Courtney · Dec 3, 2019

fresh_42 said:

You can produce your own random data as it is just for practice purposes. This way you could also play with the distribution function.

jedishrfu said:

Yes, use the actual ideal solution to generate one test set and then use the same code but insert small errors to the data to simulate measurement inaccuracies and generate a second testset.

The errors should mimic the real world inaccuracies like distance in meters with an accuracy of +- 1cm would mean adding a range of say - 1.5< delta_cm < 1.5 cm to each:

delta_cm = random.random()*3.0 - 1.5

where the random.random() returns a random value between

https://pythonprogramminglanguage.com/randon-numbers/

I've mentored a couple student projects using the approach of generating data with a predictive model plus normally distributed random noise. Compared with using real experimental data, the learning process of the student was somewhat suboptimal, so I always figured out how the student could find and use real experimental data also. Even adding noise, computer generated data does not provide a fully authentic learning experience, since real experimental data often has imperfections other than the experimental uncertainties on the dependent variable.

Some of these can be simulated with additional effort. For example, jitter or error can be added to the independent variable as well using the same method used to generate a normally distributed appropriately scaled error for the dependent variable. But even this approach still assumes a data set in which the values of the independent variable are equally spaced, or nearly equally spaced, over an interval. Another approach is to generate random numbers for the independent variable in a given interval representing the anticipated measurement range of a proposed experiment. The point is that real experiments always have uncertainties in both variables, and often the independent variable is not as controlled as well as it is measured.

The learning process when we train aspiring scientists can and should include handing a wide variety of imperfections in experimental data, because that is what experimentalists tend to provide.

jedishrfu · Dec 3, 2019

I agree with you @Dr. Courtney. It can be very very difficult to generate realistic looking data.

One time i needed to create a dummy database of customer transaction data like when they clicked on a link, and when they bought stuff. And further to have it appear as multiple customers shopping in real time.

After every attempt, some pattern appeared in the data that could be tracked back to the generating program prompting us to try again. It tested the system we developed but real customer activity found the race condition bugs we were looking for.

In any event, it was a fun coding experience using AWK, weighted arrays plus random indices and SQL to load the database table data. A weighted array is an n element array where the values are repeated ie

1111112222222222222233334555666

So that randomly selecting an element will return a 2 its the most common value in the array.

Taylor_1989 · Dec 13, 2019

WWGD said:

Have you checked Data.gov and UC Irvine's site? They both have massive and varied datasets.

I did check the Data.gov site but didn't really find anything that I was looking specifically, but the UC database is very interesting. Thank for the link.

Find Experimental Datasets for Python Modeling

1. What is the purpose of finding experimental datasets for Python modeling?

2. Where can I find experimental datasets for Python modeling?

3. How do I know if an experimental dataset is suitable for Python modeling?

4. Can I use any dataset for Python modeling?

5. Are there any limitations to using experimental datasets for Python modeling?

Similar threads

Hot Threads

Recent Insights