What is the Best Way to Get Started in Data Mining?

  • Thread starter Alternamaton
  • Start date
  • Tags
    Data
In summary, the expert suggests that someone should learn differential and integral calculus, and then do the same for statistical analysis in order to become an effective data miner.
  • #1
Alternamaton
5
0
Hi all,

A while back I set out to learn Calculus, but I quickly lost interest. I realize now that my problem was my failure to set a clear goal. I was just reading through chapters of online textbooks and doing exercises, but I never felt like I was getting any closer to anything. Sure, I was "learning Calculus", but there's really no end to Calculus, or math in general for that matter. You can "learn" it forever, all day just sitting there reading textbooks.

So I asked myself why I felt the need to learn Calculus, and the answer is that I feel like I am not equipped to analyze the world at large adequately. I constantly wonder about the causes of things, particularly in the context of large groups of people. When I realized this, I realized that the real reason that I need to learn Calculus is so that I can use statistical methods to analyze data about people.

This brings us to the subject of this thread: data mining. I'll admit up front that I haven't "done my research" regarding the topic--that is, I haven't scoured the web for resources on it before resorting to asking for information on a forum. However, I think that such an approach is backwards, anyway; if you are interested in learning about a topic, it is much, much more efficient to find an expert in that topic and ask him to point you in the right direction than it is to wander around blindly, reading this article and that book on the subject at random. For instance, I have played and studied chess for several years, and I know for a fact that someone would be better off asking me where to start in terms of learning the game than they would be wandering the internet reading articles and ebooks on chess.

Anyway, with that disclaimer aside, can anyone with experience in the field of data minining or statistical analysis of data "point me in the right direction"? In my own opinion, the best course of action would be to learn differential and integral calculus by doing mostly word problems (I am a huge believer in word problems when it comes to really learning math), and then to do the same for statistical analysis.

Can anyone modify or flesh out this plan for me (preferrably with useful links? :D)?

Thank you for your time.
 
Physics news on Phys.org
  • #3
Data mining is just a fancy, marketing generated name, for (statistical) analysis. For centuries, going back at least to Tycho Brahe, Copernicus, Kepler, and on and on, people have been doing data mining. Planck's invention of the quantum was based on data mining.

In more modern times, data mining is primarily focussed on finding patterns in data. Finding patterns in spending of credit card users, or in temporal patterns in renewals of magazine subscriptions are typical business data-mining problems. The credit card issue was to find patterns that might be used to better targeting of automobiles. The subscription work was oriented toward finding optimal times for mailing subscription renewal notices. I've been involved in these and many other similar problems for the past almost 40 years. Most of us who were doing this kind of work in the 1970s and on never knew we were doing data mining, we called it statistical analysis or just plain analysis.

The credit card problem, with a sample size of over a million card holders extending over three years required factor analysis, some regression, and ultimately cluster analysis. We started with neural networks, but they didn't work well. This problem took three PhDs almost a year to solve. The plain fact was that most people tended to exhibit similar spending patterns -- we finally used some non-linear transformations to tease out distinct patterns. The subscription problem was soved using various forms of regression - LMS and logistic.

To do this kind of work you need a sophisticated knowledge of math, preferably through differential equations, a strong level of comfort with statistics, including time series analysis, and the ability to turn business problems into quantitative form -- not always easy, if it were not for clients, the work would be simple.

The best data analysts I've ever encountered were either trained as physicists or engineers -- solving problems with math is central to these disiplines.

The best route to success: solid knowledge of undergraduate math and statistics, some physics or engineering, and a business course or two. Then, often the best way to learn is to find a mentor -- a boss, a friend, whatever. Ultimately the kind of work that goes with data mining requires a strong intuition -- the real problems are seldom similar to the academic course problems.

On your own, start with a book(s) on business statistics; keep it practical. You do need calculus to get a good grounding in statistics, but learn it as needed.

Regards,

Reilly Atkinson
 
Last edited:

1. What is data mining?

Data mining is the process of extracting useful patterns and insights from large sets of data. It involves using various techniques, such as machine learning and statistical analysis, to identify correlations, trends, and anomalies within the data.

2. What are the benefits of data mining?

Data mining can provide valuable insights and knowledge that can help businesses make informed decisions. It can also improve efficiency and productivity by automating repetitive tasks and identifying areas for improvement. Additionally, data mining can help identify potential risks and opportunities, leading to better strategic planning.

3. What are the steps involved in data mining?

The data mining process typically involves the following steps: problem definition, data collection, data preparation, data exploration, modeling, evaluation, and deployment. These steps may vary depending on the specific techniques and tools used.

4. What are some common techniques used in data mining?

Some common techniques used in data mining include classification, clustering, regression, association rule mining, and anomaly detection. These techniques are used to identify patterns and relationships within the data and can be applied to various types of data, such as structured, unstructured, and semi-structured data.

5. What are the ethical considerations of data mining?

As with any technology, there are ethical considerations to be aware of when using data mining. These include issues of privacy, bias, and the potential for misuse of data. It is important for data miners to adhere to ethical guidelines and ensure that the data being used is obtained and used ethically and responsibly.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Sticky
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
4K
  • Science and Math Textbooks
Replies
6
Views
1K
  • STEM Academic Advising
Replies
5
Views
758
Replies
1
Views
31
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • New Member Introductions
Replies
3
Views
70
  • STEM Academic Advising
Replies
29
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • STEM Academic Advising
Replies
14
Views
1K
Back
Top