Statistics and Data Science applied to Physics

Click For Summary

Discussion Overview

The discussion revolves around the application of statistics and data science to the field of physics, particularly in the context of handling large datasets generated by modern experiments in cosmology and particle physics. Participants explore the potential for collaboration between physicists and experts in data analysis, machine learning, and statistics, as well as the challenges and cultural dynamics involved in such interdisciplinary efforts.

Discussion Character

  • Exploratory
  • Debate/contested
  • Technical explanation

Main Points Raised

  • One participant notes the increasing complexity and size of datasets in physics, suggesting a need for collaboration with statisticians and data scientists.
  • Another participant emphasizes the concept of "big data" and its relevance across various fields, including physics, while seeking specific examples of interdisciplinary collaborations.
  • Concerns are raised about skepticism within the High Energy Physics (HEP) community regarding machine learning, particularly issues related to overtraining and the selection of appropriate training sets.
  • Some participants discuss cultural clashes between HEP physicists and data scientists, attributing these to differences in language and approach, as well as perceived arrogance.
  • Interest is expressed in the performance of algorithms from Kaggle challenges, with observations that some algorithms are faster than traditional physics-based methods.
  • Questions are posed about how to identify overtraining in models when only one real dataset is available, highlighting the complexities of data interpretation in physics.

Areas of Agreement / Disagreement

Participants express a range of views regarding the collaboration between physicists and data scientists, with some acknowledging the potential benefits while others highlight skepticism and cultural barriers. The discussion remains unresolved regarding the best approaches to integrating these fields effectively.

Contextual Notes

Participants mention limitations related to the understanding of overtraining and the challenges of selecting appropriate features for training datasets. There is also an acknowledgment of the unique characteristics of datasets in physics that may complicate standard data analysis practices.

StatGuy2000
Education Advisor
Gold Member
Messages
2,073
Reaction score
1,155
I wasn't sure where to post this, but I figured this would be a topic under General Physics. I am aware that the next generation of observations, ranging from cosmological observations to post-LHC particle physics experiments, will produce overwhelmingly large and complex datasets, far larger than what many physicists are accustomed to working with.

This leads to me to believe that this should lead to potential collaborative opportunities between physicists and statisticians, applied mathematicians, and computer scientists specializing in machine learning & complex database research, whose very expertise involve the analysis of large, complex datasets.

I was wondering if anyone here at PF are aware of such collaborative research groups. The only group I'm aware of is the astrostatistics research group at Carnegie Mellon, but perhaps there may be more people in the know here. Thanks!
 
Physics news on Phys.org
This is the rapidly growing field of "big data". Big data is a field of study on how to analyse, organize, and visualize massive amounts of data. It is a problem in many fields. There are a lot of research efforts to address it. See https://en.wikipedia.org/wiki/Big_data .
 
There have been some Kaggle challenges. I think there is quite some skepticism from the HEP community on machine learning. There is the risk of overtraining; selecting the best solution has some statistical uncertainty to it; and you're never really sure if the training is finding events that have the features you want and not features that are accidentally correlated with the features you want. And there have been some culture clashes. HEP physicists have a reputation for being arrogant, but even they are taken aback by some of the data scientists.

What I found most interesting from the last Kaggle weren't that there were algorithms that performed marginally better than the ones developed by people with physics knowledge, not software optimizers, but that many of the better algorithms were substantially faster than ours.
 
FactChecker said:
This is the rapidly growing field of "big data". Big data is a field of study on how to analyse, organize, and visualize massive amounts of data. It is a problem in many fields. There are a lot of research efforts to address it. See https://en.wikipedia.org/wiki/Big_data .

I am familiar with the field of "big data" and am aware of the problems related in many fields (particularly in many areas of business). The Wikipedia article you cite does list complex physics simulations -- my question was more on any specific collaborations between data scientists (whether they be computer scientists, applied mathematicians, or statisticians) and physicists to address these problems (e.g. interdisciplinary groups working on the challenges of big data in physics).
 
Vanadium 50 said:
There have been some Kaggle challenges. I think there is quite some skepticism from the HEP community on machine learning. There is the risk of overtraining; selecting the best solution has some statistical uncertainty to it; and you're never really sure if the training is finding events that have the features you want and not features that are accidentally correlated with the features you want. And there have been some culture clashes. HEP physicists have a reputation for being arrogant, but even they are taken aback by some of the data scientists.

What I found most interesting from the last Kaggle weren't that there were algorithms that performed marginally better than the ones developed by people with physics knowledge, not software optimizers, but that many of the better algorithms were substantially faster than ours.

I can certainly see that the risk of overfitting is very real with respect to HEP data and needs to be carefully considered by both data scientist and physicist alike (also, choosing the appropriate training set with the features necessary is an issue).

The cultural clashes between HEP physicists and data scientists is interesting -- setting aside the arrogance for the moment, I wonder if the use of different language between the two scientific groups may also contribute to the "clashes" or barriers in interdisciplinary cooperation.

I'm also interested in learning more about the Kaggle challenges related to HEP physics, as well as what types of algorithms you've found that were substantially faster. Any links would be greatly appreciated.
 
If you Google "Kaggle LHC" you'll get links to the challenges.

Overtraining is a big problem. How do you know you have done it? There is only the one real dataset. If your dataset is "everyone who shops at Macy's", you also only have the one dataset, but if you have odd features that happen to show up in this instance but wouldn't appear if you repeated the experiment, you don't care. They are real for this dataset. But if your code seizes of the fact that one collects Higgs bosons preferentially on Tuesdays, you're training on random fluctuations.
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 15 ·
Replies
15
Views
4K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 10 ·
Replies
10
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 29 ·
Replies
29
Views
5K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 1 ·
Replies
1
Views
1K