Statistics and Data Science applied to Physics

AI Thread Summary
The discussion highlights the increasing complexity of datasets in physics, particularly with upcoming cosmological and particle physics experiments, necessitating collaboration between physicists and experts in statistics, applied mathematics, and machine learning. There is skepticism within the High Energy Physics (HEP) community regarding machine learning techniques, primarily due to concerns about overfitting and the challenges of selecting appropriate training features. Cultural clashes between HEP physicists and data scientists are noted, stemming from differing terminologies and attitudes. Recent Kaggle challenges have shown that some machine learning algorithms can outperform traditional physics-based methods in speed, although the accuracy remains a concern. Overall, the conversation emphasizes the need for interdisciplinary cooperation to effectively tackle the challenges posed by big data in physics.
StatGuy2000
Education Advisor
Gold Member
Messages
2,064
Reaction score
1,161
I wasn't sure where to post this, but I figured this would be a topic under General Physics. I am aware that the next generation of observations, ranging from cosmological observations to post-LHC particle physics experiments, will produce overwhelmingly large and complex datasets, far larger than what many physicists are accustomed to working with.

This leads to me to believe that this should lead to potential collaborative opportunities between physicists and statisticians, applied mathematicians, and computer scientists specializing in machine learning & complex database research, whose very expertise involve the analysis of large, complex datasets.

I was wondering if anyone here at PF are aware of such collaborative research groups. The only group I'm aware of is the astrostatistics research group at Carnegie Mellon, but perhaps there may be more people in the know here. Thanks!
 
Physics news on Phys.org
This is the rapidly growing field of "big data". Big data is a field of study on how to analyse, organize, and visualize massive amounts of data. It is a problem in many fields. There are a lot of research efforts to address it. See https://en.wikipedia.org/wiki/Big_data .
 
There have been some Kaggle challenges. I think there is quite some skepticism from the HEP community on machine learning. There is the risk of overtraining; selecting the best solution has some statistical uncertainty to it; and you're never really sure if the training is finding events that have the features you want and not features that are accidentally correlated with the features you want. And there have been some culture clashes. HEP physicists have a reputation for being arrogant, but even they are taken aback by some of the data scientists.

What I found most interesting from the last Kaggle weren't that there were algorithms that performed marginally better than the ones developed by people with physics knowledge, not software optimizers, but that many of the better algorithms were substantially faster than ours.
 
FactChecker said:
This is the rapidly growing field of "big data". Big data is a field of study on how to analyse, organize, and visualize massive amounts of data. It is a problem in many fields. There are a lot of research efforts to address it. See https://en.wikipedia.org/wiki/Big_data .

I am familiar with the field of "big data" and am aware of the problems related in many fields (particularly in many areas of business). The Wikipedia article you cite does list complex physics simulations -- my question was more on any specific collaborations between data scientists (whether they be computer scientists, applied mathematicians, or statisticians) and physicists to address these problems (e.g. interdisciplinary groups working on the challenges of big data in physics).
 
Vanadium 50 said:
There have been some Kaggle challenges. I think there is quite some skepticism from the HEP community on machine learning. There is the risk of overtraining; selecting the best solution has some statistical uncertainty to it; and you're never really sure if the training is finding events that have the features you want and not features that are accidentally correlated with the features you want. And there have been some culture clashes. HEP physicists have a reputation for being arrogant, but even they are taken aback by some of the data scientists.

What I found most interesting from the last Kaggle weren't that there were algorithms that performed marginally better than the ones developed by people with physics knowledge, not software optimizers, but that many of the better algorithms were substantially faster than ours.

I can certainly see that the risk of overfitting is very real with respect to HEP data and needs to be carefully considered by both data scientist and physicist alike (also, choosing the appropriate training set with the features necessary is an issue).

The cultural clashes between HEP physicists and data scientists is interesting -- setting aside the arrogance for the moment, I wonder if the use of different language between the two scientific groups may also contribute to the "clashes" or barriers in interdisciplinary cooperation.

I'm also interested in learning more about the Kaggle challenges related to HEP physics, as well as what types of algorithms you've found that were substantially faster. Any links would be greatly appreciated.
 
If you Google "Kaggle LHC" you'll get links to the challenges.

Overtraining is a big problem. How do you know you have done it? There is only the one real dataset. If your dataset is "everyone who shops at Macy's", you also only have the one dataset, but if you have odd features that happen to show up in this instance but wouldn't appear if you repeated the experiment, you don't care. They are real for this dataset. But if your code seizes of the fact that one collects Higgs bosons preferentially on Tuesdays, you're training on random fluctuations.
 
I think it's easist first to watch a short vidio clip I find these videos very relaxing to watch .. I got to thinking is this being done in the most efficient way? The sand has to be suspended in the water to move it to the outlet ... The faster the water , the more turbulance and the sand stays suspended, so it seems to me the rule of thumb is the hose be aimed towards the outlet at all times .. Many times the workers hit the sand directly which will greatly reduce the water...

Similar threads

Replies
3
Views
2K
Replies
2
Views
3K
Replies
11
Views
2K
Replies
29
Views
5K
Replies
6
Views
2K
Back
Top