Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Statistics and Data Science applied to Physics

  1. Dec 18, 2015 #1

    StatGuy2000

    User Avatar
    Education Advisor

    I wasn't sure where to post this, but I figured this would be a topic under General Physics. I am aware that the next generation of observations, ranging from cosmological observations to post-LHC particle physics experiments, will produce overwhelmingly large and complex datasets, far larger than what many physicists are accustomed to working with.

    This leads to me to believe that this should lead to potential collaborative opportunities between physicists and statisticians, applied mathematicians, and computer scientists specializing in machine learning & complex database research, whose very expertise involve the analysis of large, complex datasets.

    I was wondering if anyone here at PF are aware of such collaborative research groups. The only group I'm aware of is the astrostatistics research group at Carnegie Mellon, but perhaps there may be more people in the know here. Thanks!
     
  2. jcsd
  3. Dec 18, 2015 #2

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    This is the rapidly growing field of "big data". Big data is a field of study on how to analyse, organize, and visualize massive amounts of data. It is a problem in many fields. There are a lot of research efforts to address it. See https://en.wikipedia.org/wiki/Big_data .
     
  4. Dec 18, 2015 #3

    Vanadium 50

    User Avatar
    Staff Emeritus
    Science Advisor
    Education Advisor

    There have been some Kaggle challenges. I think there is quite some skepticism from the HEP community on machine learning. There is the risk of overtraining; selecting the best solution has some statistical uncertainty to it; and you're never really sure if the training is finding events that have the features you want and not features that are accidentally correlated with the features you want. And there have been some culture clashes. HEP physicists have a reputation for being arrogant, but even they are taken aback by some of the data scientists.

    What I found most interesting from the last Kaggle weren't that there were algorithms that performed marginally better than the ones developed by people with physics knowledge, not software optimizers, but that many of the better algorithms were substantially faster than ours.
     
  5. Dec 20, 2015 #4

    StatGuy2000

    User Avatar
    Education Advisor

    I am familiar with the field of "big data" and am aware of the problems related in many fields (particularly in many areas of business). The Wikipedia article you cite does list complex physics simulations -- my question was more on any specific collaborations between data scientists (whether they be computer scientists, applied mathematicians, or statisticians) and physicists to address these problems (e.g. interdisciplinary groups working on the challenges of big data in physics).
     
  6. Dec 20, 2015 #5

    StatGuy2000

    User Avatar
    Education Advisor

    I can certainly see that the risk of overfitting is very real with respect to HEP data and needs to be carefully considered by both data scientist and physicist alike (also, choosing the appropriate training set with the features necessary is an issue).

    The cultural clashes between HEP physicists and data scientists is interesting -- setting aside the arrogance for the moment, I wonder if the use of different language between the two scientific groups may also contribute to the "clashes" or barriers in interdisciplinary cooperation.

    I'm also interested in learning more about the Kaggle challenges related to HEP physics, as well as what types of algorithms you've found that were substantially faster. Any links would be greatly appreciated.
     
  7. Dec 20, 2015 #6

    Vanadium 50

    User Avatar
    Staff Emeritus
    Science Advisor
    Education Advisor

    If you Google "Kaggle LHC" you'll get links to the challenges.

    Overtraining is a big problem. How do you know you have done it? There is only the one real dataset. If your dataset is "everyone who shops at Macy's", you also only have the one dataset, but if you have odd features that happen to show up in this instance but wouldn't appear if you repeated the experiment, you don't care. They are real for this dataset. But if your code seizes of the fact that one collects Higgs bosons preferentially on Tuesdays, you're training on random fluctuations.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook




Similar Discussions: Statistics and Data Science applied to Physics
  1. Statistical Physics (Replies: 3)

  2. How do I analyze data? (Replies: 4)

Loading...