1. Limited time only! Sign up for a free 30min personal tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Other Is BigData/MachineLearning/DeepLearning really the future?

  1. Jun 9, 2016 #1
    Browsing through the internet, I keep hearing of these areas as if they hold so much promise. But, it could be the case that the internet is being exaggerated. Do you think it is the case that humanity will see great advances in these areas and will depend largely on these advances? Why? Anything interesting you have to tell me about these areas?

    I'm thinking about doing PhD in this area, so I wanted to know what you guys thought about these areas.

  2. jcsd
  3. Jun 10, 2016 #2
    Follow your muse, not those of the pundits. Seriously, what do YOU want to do? What intrigues you?

    Big Data is just data. Does merging databases intrigue you? It doesn't to me. Yes, they discover lots of correlations. What most of these data geeks fail to note is that the data contain artifacts due to the collection methods. Many will not realize this, and follow those data collection artifacts right in to the weeds instead of talking to someone to figure out what is going on. You REALLY need to know a lot of background about the data, how it was gathered, why it was gathered, and what it was used for before you start your analysis. Internet people make it out to be like discovering pearls of wisdom with a few short SQL queries. It ain't like that.

    If I knew what the next big thing was going to be in your lifetime, I'd tell you. But I don't know. I used to think that some day I might actually have the opportunity to live on the moon. That hasn't happened yet and at the rate things are going I doubt I'll live to see it.

    The truth is that people make their futures. So if you know what you like, study it. If you don't know what you like, figure that out first. Don't follow pundits. They can not know what your life will be like. And even you can't know what that will be like. So find something that amuses you and have fun with it.
  4. Jun 10, 2016 #3


    User Avatar
    Science Advisor
    Gold Member

    Big data is certainly a rapidly growing field and there is an increasing amount of academic research surrounding it.
    The same is true for machine-learning, although I suspect most of the "market" is not in AI or similar areas that get a lot of press, but rather is mundane areas such as as image recognition for industry, curve fitting/modelling of complex systems etc.
  5. Jun 10, 2016 #4


    User Avatar
    Education Advisor

    Jake, I should note that statisticians, who are really the original "data scientists" (or "data geeks" as you would rather call them), given that analyzing and making sense of data is a key part of the mission of statistics) have been making the very points you raise for years. The "collection method" you talk about, along with issues with the experimental design, data quality, etc. are all critical components of the education of statisticians. When people talk about "lies, damn lies, and statistics", the problem isn't with statisticians, but people with no background, training in or understanding of statistics analyzing data and ignoring the issues you raise. Many (but by no means all) of the Internet people you speak above are guilty of this.

    I'm raising this because your post above indicates a fair degree of contempt towards statisticians or statistics, which is unwarranted.
  6. Jun 10, 2016 #5
    My apologies, real statisticians and real data analytics can be a good thing. However, they're expensive and the ones who know their craft are not that common. My experience hasn't been that pleasant. Perhaps it's because the people perpetrating these studies don't have the education or the experience, while their bosses have been reading too many glossy trade magazines, thinking that this is easy (which I think we both agree it is not).

    We often get people trying to massage the data that we produce from our control systems. They see this fountain of delicious high fructose data, and they think there must be nuggets of information that nobody has ever considered in there. But do they talk to the plant engineers? No. Do they talk to the senior operators? No. Do they talk to the process engineering staff who configured the data gathering software? No again! Then they produce these "profound" discoveries and usually tell us what we already knew; or tell us of "discoveries" that are accurate, but don't take the whole picture of how the process is managed in to account.

    Big Data is the kind of endeavor where you have to immerse yourself in the process, and the flow of the data. People who can follow that well are not commonplace. Many make this out to be managing large databases, or "mining data." The problem is that there are MANY posers and pretenders out there (either that or they tend to gravitate toward my business), but damned few who really know the craft.
  7. Jun 10, 2016 #6


    User Avatar
    Education Advisor

    As I see it, part of the issue is that those who are gravitating toward data science positions, like the people who are trying to "massage" your data (some of whom may have graduate studies in statistics, but many also coming from computer science, physics, or other quantitative disciplines) aren't trained in working with domain experts like the engineers you work with to understand the source of the data and the underlying process involved. Statisticians are guilty of this at times as well, but in my graduate program, my applied stats professor (who also was leading the statistical consulting centre) emphasized to me how important it was to communicate and work with domain experts, and this was an important lesson that I've always taken during my professional career as a statistician.

    My very first job out of grad school involved working with engineers analyzing data from experiments to test the effectiveness of G-suits on attenuating the effects of G-forces that military pilots experience. I made it my business to understand the experiments, talking to the engineers, the military staff, and such, to understand the source of the data. Now I work in the pharma/biotech sector, I and other statisticians work closely with physicians, pharmacists, biochemists, and other folks to truly understand the problem at hand.

    I should also point out that the problem isn't restricted solely to statisticians or data scientists. Often times, what I've also observed is that statisticians or data scientists are only called in once all of the data have been collected and are often asked to provide their expertise without being provided the type of background information you speak of, and without the ability to shape the decision making on how that background information is gathered. In which case the best that statisticians can do is to work with the data at hand, no matter how badly that data is collected. There is a saying among statisticians: "Garbage in, garbage out." By which we mean that you can apply the fanciest methodologies out there, imaginable, but if the source data is flawed (due to issues in experimental design, collection, data quality, etc.), then any subsequent results of the analyses will be equally flawed.

    The takeaway here is that it is critical that statisticians be brought in and involved from the very outset that the data is being collected and should ideally be involved in the decision making on how experiments are being designed, for them to be the most effective in their jobs. And it is critical for statisticians to work collaboratively with others to effectively do their jobs.
  8. Jun 10, 2016 #7


    User Avatar
    Science Advisor
    Education Advisor

    "Informatics" - it's like one of those rock 'n roll words that doesn't really mean anything - like "ramalamadingdong," or "givepeaceachance."
  9. Jun 11, 2016 #8
    Well, big data is just a fact of life now. It is different than the other two things you mention, in that it is something that exists, not a tool that's used.

    Many machine learning techniques are 30 - 40 years old. Modern computational power combined with more data has allowed it to flourish. I wouldn't say it's the future, I'd say it's the present, and will be used a great deal going forward. Deep learning is a branch of machine learning.

    This article is a good introduction to the difference in philosophy and culture between machine learning and statistics.
  10. Jun 11, 2016 #9
    I realize statguy addressed this, but it deserves re-emphasizing:

    Merging databases doesn't discover correlations. Also, statisticians and data scientists are not intrigued by merging databases. It's just something you do to build a useful data set. The intriguing part comes next.

    This is spectacularly untrue. I've been in the data analytics business (in one form or another) more than 7.5 years now, first as an actuary and now as a technical manager in a research department. I've worked with many people at three different large companies (two fortune 100), and I cannot emphasize how absolutely untrue this is.

    The analytics engine I'm working on now depends on data. I spend ~50% of my time in meetings about the data. I'm closely connected with the people who produce the data. We have extensive code that tests data quality and performs removals/imputations. This is not unusual, and my colleagues do the same.

    There are exceptions to every rule, and the rule is that statisticians and data scientists are constantly thinking about the quality of their data.
  11. Jun 11, 2016 #10
    Sorry guys. There ain't enough beer in the world to drink while discussing all the war stories and battle scars I have from incredibly shoddy analysis of the data we gather and post on our processes.

    There are certain fields of study and practice that tend to attract pretenders. Antenna engineering is one example. It is complex. It is hard to measure the performance accurately, and the need for such services is relatively infrequent. As such it is very difficult to tell who knows the business and who doesn't. That's the kind of problem that you get with data analysis consultants too.

    Furthermore, it is hard for an un-sexy business such as water and waste-water utilities to attract the best and brightest. Often we encounter people with a very misguided environmental policy they want to foist upon us. It wouldn't surprise me to learn that we are a magnet for some of the worst of the data analytics business. We often see people who have a conclusion all written out and they're just looking for data to cherry-pick to prove their point. I tire of the pretenders, posturing, and stupidity of the people who take data at face value and arrive at ridiculous conclusions.

    Yes, these are sour grapes. I'm sorry to be the bearer of news that there are imposters out there. From my anecdotal experience these numbers aren't small. This is not to devalue the services of those who actually DO know what they're doing. Unfortunately, from a high level manager's point of view, it's damned hard to tell the difference.
  12. Jun 11, 2016 #11
    And if your post had said "Data analytics hasn't done much for me!" then you'd have had a much different reaction. That's because it's your experience, and frankly management's poor understanding of what data analytics can and can't do is probably something we could all have a good laugh about.

    But that's not what you said. Instead, along with some really confused notes on data merging and SQL queries, you made broad generalizations about a pool of varying careers that spans many industries.

    I shouldn't have to say this, but it's entirely possible that your small sample of experiences - and it is a small sample - could be totally true, and yet it still not be an appropriate generalization. One way to find out would be to consult people who actually know something about this.

    So I'd point out to you and other readers that there is an Actuarial Standard of Practice devoted to data quality. Literally any time a credentialed actuary gives anyone an opinion they are expected to have considered issues related to data, including data consistency, reasonableness and its limitations. That's just one example, but it's a broad one that impacts tens of thousands of professionals working every day in the US.

    Between statguy and myself, we could come up with many, many more. There are entire seminars and meetings devoted to data collection and wrangling. Vast troves of publications and guidelines. Entire departments at companies.

    So you have a point of view and I'm glad you've had the chance to communicate it. But as someone who spends large amounts of time doing exactly what you say I don't do, and as someone who knows large numbers of other people who work very hard doing exactly the thing you say they don't do, you can expect me to try and ensure other readers hear my point of view as well.
  13. Jun 11, 2016 #12

    Vanadium 50

    User Avatar
    Staff Emeritus
    Science Advisor
    Education Advisor
    2017 Award

    Jake, I usually like reading what you write, but I think you missed the mark here. If I argued that everybody in the water utilities business was an incompetent stumblebum - just look at Flint, Michigan - you would charge me with overgeneralization, and rightfully so. Is this really any different?
  14. Jun 13, 2016 #13


    User Avatar
    Education Advisor

    I'm not sure who else is a data scientist on this forum, but I am one. Just to give you some of my background, I come from a statistics and computer science background. I doubled majored in statistics and computer science, and have a masters degree in statistics. I worked initially building financial models, then transferred to bioinformatics, and finally became a data scientist for some large company.

    I think doing a PhD in big data is a bit needless. There's merit in studying machine learning, AI, statistical learning, etc but to study "Big Data" seems odd, especially for a PhD. There's a difference between studying machine learning and then focusing on how to apply those techniques to data sets that are terabytes in size versus learning about how to work with large data sets. The fact of the matter is a lot of "new" techniques in big data are just old techniques that we can finally use a computer to solve. A lot of the current research deals with ways to handle class imbalance and subsampling to improve performance. By the time you'r ready for a PhD who knows what current research will focus on? The more important part is that you're interested in the field, not the buzz.
  15. Jun 13, 2016 #14


    User Avatar
    Science Advisor
    Gold Member

    I much of this is probably true. However, there is certainly push for more academic research in this area. The research institute where I work is currently -together with various universities here in the UK- investing money in this area, apparently because there is a demand from industry and various parts of academia for more research. My understanding is that "big data" here refers to very large data sets (many terabytes) from various sources (prime examples in science would be data from the LHC at CERN and SKA). Traceability and more generally metrology for such large datasets is -I am told- far from trivial.

    edit: a quick google search gives some example of the above from NIST

  16. Jun 13, 2016 #15


    User Avatar
    Education Advisor
    Gold Member

    I have to agree that my experience is much more aligned with Jake's:wideeyed:. I have been in several situations where statistics were mis-interpreted and certainly mis-applied by so called experts:eek:. Yes, I admit there are experts too, and I also feel it IS AN EXTREMELY important field and should be a mandatory class in HS as well as requirement for a BS in college as well. It just seems that everybody who has some (meager) knowledge of stats has a high propensity for abuse of the numbers so to speak:confused:. And these charlatans often market their ineptitude because they are salesmen first, technically adept 2nd (or 3rd or last!o_O). Sadly, I have even seen engineers that evolved into sales and almost purposely gathered data that would show their products in glowing terms, but overlooked the actual results when compared to existing conditions. Or the famous cutting off the origin and showing the 5-10 point difference in data, when the actual values are 1024 vs 1036. Their reasoning is that they don't have the space to show that first 1000 points etc. I see this crap ALL of the time, usually from sales literature.
    I feel I can smell Baloney and cooked numbers as well as any?:) as I am both a PE in EE and a former Certified Reliability Engineer from the American Society of Quality (the same outfit that certifies Six Sigma Black Belts), with much of my own QC training directly from Motorola, the originator's of Six Sigma mythology's currently used by ASQ (and several other misnomers).
  17. Jun 13, 2016 #16


    User Avatar
    Education Advisor

    I hope you don't get the impression that I don't think there's research that deals with big data specific issues. I myself work on stream of data that amounts to 14 terabytes of raw unprocessed information per day. I understand very well that researchers devote a considerable amount of time solving problems related to the shear magnitude of the data. However, I have mostly seen this from the lens of a researcher with a background in statistical learning, or machine learning or algorithms or a myriad of other fields taking their specialized knowledges and finding ways that can efficiently take known algorithms/techniques and apply them in clever and new ways to large data. What i'm mostly trying to get at is that I haven't personally encountered anyone whose focus is just "big data". I've encountered people whose focus is Statistical Learning with big data or parallel processing with big data. I think that's great.

    Will there eventually be just a field called "big data"? Maybe. However, I think of Big Data as a space within a large context. Just like number theory is part of math, big data is part of many different fields. One doesn't get a degree in Number Theory. One gets a degree in Math and you focus on number theory. Similar, I feel as if one should get a degree in cs/stats/whatever and focus that knowledge to big data if that's what interest you.
  18. Jun 13, 2016 #17
    It's an exciting "field" for a reason. However, do bear in mind that it seems to be in the hype phase, and be extra skeptical of what you decide to do.

    During the hype phase, people will take Exciting New Idea and apply it to every Old Unsolved Problem. For instance, the application of big data to medicine is a source of much excitement. Judging from the seminars and papers on genomics I've seen, this excitement is probably woefully misplaced.

    EDIT: I should point out that I don't mean to say that "big data" approaches won't help medical science, I simply mean not every problem you couldn't solve in a small data context will magically be solved by having more data, for obvious reasons.

    For instance, clustering is a very common data analysis problem. If you have a good method for deciding if two points are similar, then such techniques can be informative as the scale of your data increases. However if your method for determining whether two points are similar is nonsense, as I suspect is often the case, all more data does is provide you with a more elaborate Rorschach test.
    Last edited: Jun 13, 2016
  19. Jun 14, 2016 #18


    User Avatar
    Education Advisor

    CalcNerd, the problem that both you and Jake see is that many of those who market themselves as data science "experts" are not in fact those who have adequate training in statistics at all. Very few if any are actual statisticians -- many data scientists are in fact computer scientists or physicists by training (sorry but I just had to pick on both the CS and physics people!!) And one of the major gripes that statisticians have are those who don't have the knowledge or expertise passing themselves off as "experts" and abusing or misapplying basic statistical principles and presenting their data.

    I understand that it isn't always easy to see who is the real deal and who is the charlatan when it comes to who will analyze your data. My own take is that the more modest the claim that data science experts make in terms of the value they bring to your organization, the more trustworthy the actual value of the results of their analyses are.
  20. Jun 14, 2016 #19
    Measurement errors coming the collection method itself are absolutely a thing.
  21. Jun 14, 2016 #20


    User Avatar
    Education Advisor

    I'm 100% sure that Locrian isn't referencing the fact that data contains noise, but rather disputing the point that we're unaware of the presence of noise.
  22. Jun 14, 2016 #21
    Re-reading, I see what he meant, my mistake.
  23. Jun 15, 2016 #22
    Are you fascinated with problem solving? Does "np-complete" get you excited? Computer science is much more than searching data sets for correlations and biases. Yes, there are scientific questions that traditional experimentation alone is unlikely to solve, and advances in computing technology and techniques could provide answers within your lifetime.
  24. Jun 16, 2016 #23


    User Avatar
    Education Advisor

  25. Jun 17, 2016 #24
    Since the thread addresses machine learning specifically, let me throw this out there:

    Most machine learning algorithms contain no statistical models at all. I joke with people that machine learning is just statistics without any statistics.

    The computational side of statistics (as opposed to study design) is ultimately a problem of optimization – using your sample data, can you identify the parameter estimates that most closely represent test data. For normally distributed errors, this maximum likelihood problem devolves into optimizing least squares, which is trivial. For non-normal distributions, the problem is not always trivial, and those who have dealt with complicated mixed models know that convergence problems are really a thing.

    But the statistical assumptions going into the process are just that – assumptions. For more complicated systems that aren’t carefully controlled (meaning, just about anything that happens in life outside a lab setting), the assumptions are going to be an approximation. Often, they’re good approximations, and it’s always amazed me how often simple linear approximations work well.

    But they’re still approximations. Machine learning algorithms do away with that, and simply optimize forecasts using a process. There’s still a constraint inherent in tool choice, but for many complicated processes machine learning algorithms perform vastly better than traditional statistical tools such as logistic regressions.

    However, there’s a price to pay: less interpretability. The parameter estimates that come out of statistical analysis provide a reasoning that users can (sometimes) use to interpret the result. If someone asks me why the output of a linear regression changed, I should be able to trace that back to the input data and tell them why the model believes the forecast should be different. No such meaningful process exists for bagged trees, and for things like support vector machines I would argue such tracebacks aren’t enlightening.

    There are times when we can have both accuracy and interpretability, but often it’s a tradeoff. I’ve made it sound like statistics is on the interpretable side, but it’s really more complicated than that. Adding multiple interactions between different variables can improve forecast accuracy, but interpretation becomes increasingly difficult, even if possible.

    This is all more complicated than this (and I'm sure others will have lots of additions and corrections) but I think this gives a taste of how machine learning differs from traditional statistics. I think increasingly the two aren’t seen as different areas, and I sometimes hear the combined field called “statistical learning” – e.g. Elements of Statistical Learning.
  26. Jun 18, 2016 #25


    User Avatar
    Science Advisor

    Why don't you consider eg. an artificial neural network as just a statistical model with many parameters?

    Eg. "Logistic regression is a special case of the MLP with no hidden layer (the input is directly connected to the output) and the cross-entropy (sigmoid output) or negative log-likelihood (softmax output) loss. It corresponds to a probabilistic linear classifier and the training criterion is convex in terms of the parameters (which guarantees that there is only one minimum, which is global)." http://www.iro.umontreal.ca/~bengioy/ift6266/H12/html.old/mlp_en.html
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted