Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Data Mining

  1. Dec 13, 2006 #1
    Hi all,

    A while back I set out to learn Calculus, but I quickly lost interest. I realize now that my problem was my failure to set a clear goal. I was just reading through chapters of online textbooks and doing exercises, but I never felt like I was getting any closer to anything. Sure, I was "learning Calculus", but there's really no end to Calculus, or math in general for that matter. You can "learn" it forever, all day just sitting there reading textbooks.

    So I asked myself why I felt the need to learn Calculus, and the answer is that I feel like I am not equipped to analyze the world at large adequately. I constantly wonder about the causes of things, particularly in the context of large groups of people. When I realized this, I realized that the real reason that I need to learn Calculus is so that I can use statistical methods to analyze data about people.

    This brings us to the subject of this thread: data mining. I'll admit up front that I haven't "done my research" regarding the topic--that is, I haven't scoured the web for resources on it before resorting to asking for information on a forum. However, I think that such an approach is backwards, anyway; if you are interested in learning about a topic, it is much, much more efficient to find an expert in that topic and ask him to point you in the right direction than it is to wander around blindly, reading this article and that book on the subject at random. For instance, I have played and studied chess for several years, and I know for a fact that someone would be better off asking me where to start in terms of learning the game than they would be wandering the internet reading articles and ebooks on chess.

    Anyway, with that disclaimer aside, can anyone with experience in the field of data minining or statistical analysis of data "point me in the right direction"? In my own opinion, the best course of action would be to learn differential and integral calculus by doing mostly word problems (I am a huge believer in word problems when it comes to really learning math), and then to do the same for statistical analysis.

    Can anyone modify or flesh out this plan for me (preferrably with useful links? :D)?

    Thank you for your time.
  2. jcsd
  3. Dec 13, 2006 #2


    User Avatar
    Science Advisor
    Homework Helper

  4. Dec 21, 2006 #3


    User Avatar
    Science Advisor

    Data mining is just a fancy, marketing generated name, for (statistical) analysis. For centuries, going back at least to Tycho Brahe, Copernicus, Kepler, and on and on, people have been doing data mining. Planck's invention of the quantum was based on data mining.

    In more modern times, data mining is primarily focussed on finding patterns in data. Finding patterns in spending of credit card users, or in temporal patterns in renewals of magazine subscriptions are typical business data-mining problems. The credit card issue was to find patterns that might be used to better targeting of automobiles. The subscription work was oriented toward finding optimal times for mailing subscription renewal notices. I've been involved in these and many other similar problems for the past almost 40 years. Most of us who were doing this kind of work in the 1970s and on never knew we were doing data mining, we called it statistical analysis or just plain analysis.

    The credit card problem, with a sample size of over a million card holders extending over three years required factor analysis, some regression, and ultimately cluster analysis. We started with neural networks, but they didn't work well. This problem took three PhDs almost a year to solve. The plain fact was that most people tended to exhibit similar spending patterns -- we finally used some non-linear transformations to tease out distinct patterns. The subscription problem was soved using various forms of regression - LMS and logistic.

    To do this kind of work you need a sophisticated knowledge of math, preferably through differential equations, a strong level of comfort with statistics, including time series analysis, and the ability to turn business problems into quantitative form -- not always easy, if it were not for clients, the work would be simple.

    The best data analysts I've ever encountered were either trained as physicists or engineers -- solving problems with math is central to these disiplines.

    The best route to success: solid knowledge of undergraduate math and statistics, some physics or engineering, and a business course or two. Then, often the best way to learn is to find a mentor -- a boss, a friend, whatever. Ultimately the kind of work that goes with data mining requires a strong intuition -- the real problems are seldom similar to the academic course problems.

    On your own, start with a book(s) on business statistics; keep it practical. You do need calculus to get a good grounding in statistics, but learn it as needed.


    Reilly Atkinson
    Last edited: Dec 21, 2006
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook