Statistics and Data Science applied to Physics

StatGuy2000 · Dec 18, 2015

I wasn't sure where to post this, but I figured this would be a topic under General Physics. I am aware that the next generation of observations, ranging from cosmological observations to post-LHC particle physics experiments, will produce overwhelmingly large and complex datasets, far larger than what many physicists are accustomed to working with.

This leads to me to believe that this should lead to potential collaborative opportunities between physicists and statisticians, applied mathematicians, and computer scientists specializing in machine learning & complex database research, whose very expertise involve the analysis of large, complex datasets.

I was wondering if anyone here at PF are aware of such collaborative research groups. The only group I'm aware of is the astrostatistics research group at Carnegie Mellon, but perhaps there may be more people in the know here. Thanks!

FactChecker · Dec 18, 2015

This is the rapidly growing field of "big data". Big data is a field of study on how to analyse, organize, and visualize massive amounts of data. It is a problem in many fields. There are a lot of research efforts to address it. See https://en.wikipedia.org/wiki/Big_data .

Vanadium 50 · Dec 18, 2015

There have been some Kaggle challenges. I think there is quite some skepticism from the HEP community on machine learning. There is the risk of overtraining; selecting the best solution has some statistical uncertainty to it; and you're never really sure if the training is finding events that have the features you want and not features that are accidentally correlated with the features you want. And there have been some culture clashes. HEP physicists have a reputation for being arrogant, but even they are taken aback by some of the data scientists.

What I found most interesting from the last Kaggle weren't that there were algorithms that performed marginally better than the ones developed by people with physics knowledge, not software optimizers, but that many of the better algorithms were substantially faster than ours.

StatGuy2000 · Dec 20, 2015

FactChecker said:

This is the rapidly growing field of "big data". Big data is a field of study on how to analyse, organize, and visualize massive amounts of data. It is a problem in many fields. There are a lot of research efforts to address it. See https://en.wikipedia.org/wiki/Big_data .

I am familiar with the field of "big data" and am aware of the problems related in many fields (particularly in many areas of business). The Wikipedia article you cite does list complex physics simulations -- my question was more on any specific collaborations between data scientists (whether they be computer scientists, applied mathematicians, or statisticians) and physicists to address these problems (e.g. interdisciplinary groups working on the challenges of big data in physics).

StatGuy2000 · Dec 20, 2015

Vanadium 50 said:

There have been some Kaggle challenges. I think there is quite some skepticism from the HEP community on machine learning. There is the risk of overtraining; selecting the best solution has some statistical uncertainty to it; and you're never really sure if the training is finding events that have the features you want and not features that are accidentally correlated with the features you want. And there have been some culture clashes. HEP physicists have a reputation for being arrogant, but even they are taken aback by some of the data scientists.

What I found most interesting from the last Kaggle weren't that there were algorithms that performed marginally better than the ones developed by people with physics knowledge, not software optimizers, but that many of the better algorithms were substantially faster than ours.

I can certainly see that the risk of overfitting is very real with respect to HEP data and needs to be carefully considered by both data scientist and physicist alike (also, choosing the appropriate training set with the features necessary is an issue).

The cultural clashes between HEP physicists and data scientists is interesting -- setting aside the arrogance for the moment, I wonder if the use of different language between the two scientific groups may also contribute to the "clashes" or barriers in interdisciplinary cooperation.

I'm also interested in learning more about the Kaggle challenges related to HEP physics, as well as what types of algorithms you've found that were substantially faster. Any links would be greatly appreciated.

Vanadium 50 · Dec 20, 2015

If you Google "Kaggle LHC" you'll get links to the challenges.

Overtraining is a big problem. How do you know you have done it? There is only the one real dataset. If your dataset is "everyone who shops at Macy's", you also only have the one dataset, but if you have odd features that happen to show up in this instance but wouldn't appear if you repeated the experiment, you don't care. They are real for this dataset. But if your code seizes of the fact that one collects Higgs bosons preferentially on Tuesdays, you're training on random fluctuations.

Statistics and Data Science applied to Physics

Thread 'Atmosphere on moons'

Thread 'Is space stretching or is new space being created?'

Thread 'The Physics of unloading sand from a barge'

Similar threads

Hot Threads

B How much rubidium-88 is there in nature?

I 'Set of pearls' mathematics / physics help

B Is space stretching or is new space being created?

B MIRV vs very high-yield bomb effectiveness

I How can magnetic fields contain energy?

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective