Clustering and Classification in Machine Learning

In summary, supervised learning models are used to identify different objects and classify them. Unsupervised learning models are used to find patterns in data without having pre-defined answers.
  • #1
fog37
1,568
108
TL;DR Summary
Understand the difference between clustering and classification in ML
Hello,

I started studying machine learning and its basics. One of the applications of supervised learning ML is classification, i.e. identifying different objects and classify them. Being supervised means that the ML algorithm is initially served labelled data. The labelled data is used to a) train the model and b) validate the model, i.e. check how good the model is since we know the correct answers a priori...

In unsupervised ML, the main goal is clustering and finding patterns. But isn't clustering the same thing as classification? Both are about grouping entities in different sets based on their similar characteristics. Unsupervised ML, however, does not received labelled data to be trained. I still think that an unsupervised ML model must be trained and also evaluated. How can the model be evaluated if we don't provide it with the correct answers, which sounds the same as labelled data, to check against? Is my understanding correct?
 
Technology news on Phys.org
  • #2
The term "clustering" implies that there are no pre-defined classes. The ML has complete freedom to define groups that best fit the clusters that it sees in the data. The term "classification" implies that there are a pre-defined set of group definitions that the ML should fit the data into.
 
  • Like
Likes WWGD, Klystron, fog37 and 1 other person
  • #3
Yes, classification is done with known data that is labeled. You split it into two groups one for training and one for testing. I think a common split is 80% training and 20% testing. You retrain and retest multiple times to get the accuracy up. Accuracy is determined by how often the ML matches the data label during the tests.

In KNN clustering for example, you construct a distance metric and the ML algorithm discovers what items are close together. The ML doesn't know why they are clustered only that the metric says they are similar. The most common example would be in grouping houses for sale where the distance metric might include price, sqft, location to schools, ... The metric groups houses together by their closeness in the metric. A buyer interested in one house can now look are other similar houses.

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68#:~:text=Clustering is a Machine Learning,the grouping of data points.&text=In theory, data points that,dissimilar properties and/or features.
 
  • Like
  • Informative
Likes WWGD and Klystron
  • #4
jedishrfu said:
Yes, classification is done with known data that is labeled. You split it into two groups one for training and one for testing. I think a common split is 80% training and 20% testing. You retrain and retest multiple times to get the accuracy up. Accuracy is determined by how often the ML matches the data label during the tests.

In KNN clustering for example, you construct a distance metric and the ML algorithm discovers what items are close together. The ML doesn't know why they are clustered only that the metric says they are similar. The most common example would be in grouping houses for sale where the distance metric might include price, sqft, location to schools, ... The metric groups houses together by their closeness in the metric. A buyer interested in one house can now look are other similar houses.

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68#:~:text=Clustering is a Machine Learning,the grouping of data points.&text=In theory, data points that,dissimilar properties and/or features.
Thank you. I was just reading about KNN which is a supervised machine learning model. All supervised models, as you and FactChecker mentioned, receive labeled input information to use for training and testing and learn. In supervised ML, the labelled data can be viewed as training the model using flash cards, or a set of correct questions with the associated answers, etc. from which the model learns to be prepared to face unknown new data and make correct predictions or do classify correctly.

For clustering, which is the goal of unsupervised ML models, the unsupervised model does not start with labelled data. Still, if I understand correctly, we must also split the initial unlabeled data into a training and testing sets for unsupervised models. But how do we check if the unsupevised ML model is doing a good job if we don't know what the correct answers are a priori? How do we know that the clusters it form are correct without anything to compare against? Thanks for the patience
 
  • #5
KNN can be used in either classification or clustering. In the case of clustering, the metric groups things and you research the group to decide what it represents.

in a banking application, KNN might create several groups of customers and by looking at how they are different from the whole group of customers via the customers attributes money in bank, checking, savings, age, marital status ...

One group might stand out and you notice It’s members are married with kids going to college and so you decide how to market banking products to them based on that info.

You have to be careful though as individual members of the group may not have all the attributes of the group. As an example, they may not have kids and so you wouldn’t want to market a banking product focused on the kids like kids insurance.
 
  • Like
Likes fog37
  • #6
Many thanks, jedishrfu.

One more question: linear or logistic regression is one application of supervised ML where the goal is predicting a numerical label once the model is trained using known examples (training set) and testing it with another set of known examples (validation set).

But isn't regression a common topic in statistics? So what is special about doing regression using machine learning? Is doing regression with ML any different? Is it just glorified statistics?

Thanks
 
  • #7
fog37 said:
But isn't regression a common topic in statistics? So what is special about doing regression using machine learning? Is doing regression with ML any different? Is it just glorified statistics?
The subjects are different enough that one can be expert in one and know very little about the other. In general, ML allows much more flexibility in the modeling but in doing so it loses the use of many powerful theorems that are available in most statistics
 
  • #8
This is what happens in many fields as they mature they begin to assimilate other topics. Initially, there was AI (trying to mimic intelligence) which branched off into blue sky stuff with lisp, expert systems in prolog (a topic expert's knowledge coded into prolog rules) and neural nets (reminiscent of the ghost busters scene with the pretty blond and the shock treatment given to the other student only given to the neural net when it gets things wrong) to learn from data.



Later came data mining and now machine learning is giving way to deep learning as we add stages and more stages to prep and reformat our data. Through all these transformations you can see a strong thread of statisitics.

Later folks realized that finding the best fit line through some data was a form of learning (we train to find the slope and offset and then later use it to predict and later still t o improve the prediction via continuous training) and so statistical methods got incorporated in ML and became known as statistical learning. The ML field is still evolving, developing all the various tricks of the trade that can be brought to bear on any data driven learning project.

Success often depends on how well you can reduce the attributes you measure and how well you can tune your models.

There's one really good book on all this:

Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow by Aureleon Geron that I highly recommend which can really get you up to speed via ML in Python. Much of today's ML is selecting the best strategy for the given problem, using other peoples implementation of algorithms and then evaluating how well your model performs in the real world.

https://www.amazon.com/dp/1492032646/?tag=pfamazon01-20
 
Last edited:
  • Like
Likes WWGD, anorlunda, fog37 and 1 other person
  • #9
fog37 said:
But isn't regression a common topic in statistics? So what is special about doing regression using machine learning? Is doing regression with ML any different? Is it just glorified statistics?

Yes, both involve model fitting, and there is no sharp line between them. ML tends to have more variables, and the optimization is more often non-convex. It is very helpful to think of ML as the same subject as statistics. Analogously, the ML field of reinfocement learning is closely related to the traditional discipline of control theory.
 
  • Like
Likes jedishrfu, fog37 and FactChecker
  • #10
jedishrfu said:
This is what happens in many fields as they mature they begin to assimilate other topics. Initially, there was AI (trying to mimic intelligence) which branched off into blue sky stuff with lisp, expert systems in prolog (a topic expert's knowledge coded into prolog rules) and neural nets (reminiscent of the ghost busters scene with the pretty blond and the shock treatment given to the other student only given to the neural net when it gets things wrong) to learn from data.



Later came data mining and now machine learning is giving way to deep learning as we add stages and more stages to prep and reformat our data. Through all these transformations you can see a strong thread of statisitics.

Later folks realized that finding the best fit line through some data was a form of learning (we train to find the slope and offset and then later use it to predict and later still t o improve the prediction via continuous training) and so statistical methods got incorporated in ML and became known as statistical learning. The ML field is still evolving, developing all the various tricks of the trade that can be brought to bear on any data driven learning project.

Success often depends on how well you can reduce the attributes you measure and how well you can tune your models.

There's one really good book on all this:

Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow by Aureleon Geron that I highly recommend which can really get you up to speed via ML in Python. Much of today's ML is selecting the best strategy for the given problem, using other peoples implementation of algorithms and then evaluating how well your model performs in the real world.

https://www.amazon.com/dp/1492032646/?tag=pfamazon01-20

Thank you jedishrfu. I appreciate how you tied things together.

For instance, Deep Learning is often presented as a distinguished concept from ML. However, I think they are intimately related since deep learning is essentially neural networks with many hidden layers. Also, deep learning models seem to require less structured input data while ML models (supervised, unsupervised, reinforced learning) expect the input data to be clean, well organized and structure.

Since you mention Keras and Tensorflow, I read that Keras operate on top of Tensorflow. What does that really mean? Aren't they both two ML libraries with methods that can be used in coding? Scikit-learn is another common ML library. How are they different? Maybe Keras and Tensorflow are specific for neural networks?
Keras is described as an API (application protocol interface) but I am not sure what that means and if it is synonym with library...
 
  • #11
fog37 said:
How do we know that the clusters it form are correct without anything to compare against? Thanks for the patience
There is an objective function which you are trying to minimize or maximize. I guess the name is pretty explanatory. You have some objective, and you define some function which tells you how well a candidate solution meets your objective.

In classification the objective is to guess the labels well, and there are different objective functions that are used. In dimensionality reduction and manifold learning the objective is to find a lower dimensional space where the relative structure of the data is preserved. In clustering, there are different methods with different objectives. Often these methods were developed by first thinking what they want to achieve. people will say something like "The clusters should be low in number, and the variance should be small in each cluster", or whatever they think it means to be a good clustering for whatever purpose they have for it. Then they try to find an objective function which captures that intuition. So they may add terms to the function that reward some properties and penalize others.

In some cases, you have multiple objectives that are captured by different functions. In those cases, there may be an ambiguity of what is an optimal solution. So there is a notion called Pareto optimal. A Pareto optimal solution is a solution where you cannot do better in one objective without sacrificing in another. https://en.wikipedia.org/wiki/Pareto_efficiency .
 
Last edited:
  • Like
Likes WWGD
  • #12
Working backward, an API is short for application programming interface. Keras has an api to call Tensorflow functions to simplify some of the work in making an ML application.

sometimes programmers notice that a given product like TF requires the same template of function calls and related glue code (like a recipe) to get something done and so they implement it as a function itself. This function and other related ones become the basis for the Keras API.

Heres an article on ML and DL and their differences. Basically, a DL is composed of smaller ML stages that extract features from your data that are fed into later stages of the DL before it makes a decision.

https://www.zendesk.com/blog/machine-learning-and-deep-learning/
 
  • Like
Likes fog37
  • #13
jedishrfu said:
Heres an article on ML and DL and their differences. Basically, a DL is composed of smaller ML stages that extract features from your data that are fed into later stages of the DL before it makes a decision.
I guess that is a good way to think of it. My understanding has been through the concept of the length of the credit assignment path. The surprising thing from a theoretical perspective is that one stage could theoretically provide an equivalent shallow model. The depth is actually something which is necessary for learning. It is thought to be similar to how humans learn. We use hierachical generalizations and abstractions, and we associate properties with them at each level, so we can traverse so to speak to the appropriate level quickly with as little information and processing as possible. Another approach to trying to explain it is with information theory and the information bottleneck principle. It hypothesizes that deep learning works by compressing the information to squeeze through the layers while maximizing the mutual information between the input and the objective. By doing this, the model learns these stages of features like you say, which may encode hierarchical properties that support an efficient/simple way/model to get to the solutions.

Surprisingly, deep learning (why depth is necessary) is still not explained robustly by theory. And the training process of neural networks, when modeled as a dynamical system, has been shown to exhibit chaos.
 
Last edited:
  • Like
Likes WWGD and fog37
  • #14
Yeah a lot of ML and DL reduces to statistical interpretation meaning we may never understand why it works so well and why it can fail so dramatically.
 
  • Like
Likes fog37
  • #15
Hello again,

In this discussion about clustering, which falls into the field of unsupervised learning (since the training data has no labels, i.e. we don't know the "ground truth"), how is the provided training data actually training the clustering algorithm, like the KNN algorithm?

I believe the programmer can initially set:
a) the number of desired clusters
b) which features will be used to form those clusters

For examples, in the case of a group of different flowers with the flowers having 7 different attributes (color, smell, weight, petal length, sepal length, etc.), we would probably use ALL available attributes to generate the "best" clusters OR we could select only certain attributes to form the clusters and separation between the flowers.

How do we even know that those clusters are good? There is not benchmark for performance or correctness in unsupervised learning...How is the "training data" really training the algorithm in the right direction and
making it better and better?

Is training data, in the case of unsupervised learning models, data for which we know a priori which correct clusters exist?

All ML models, including unsupervised and reinforcement learning models, need to be trained before being deployed...
 
  • #16
In deep neural networks, it is difficult or impossible to know what is going on in the intermediate levels that lead to the final classification. The mathematical methods are known, but the training results are hard to interpret.
 
  • #17
fog37 said:
How do we even know that those clusters are good? There is not benchmark for performance or correctness in unsupervised learning...How is the "training data" really training the algorithm in the right direction and
making it better and better?

Is training data, in the case of unsupervised learning models, data for which we know a priori which correct clusters exist?

All ML models, including unsupervised and reinforcement learning models, need to be trained before being deployed...
The benchmark for performance is the objective function.

Essentially, you as the developer of the algorithm decides what is 'correct' per say, by creating a function that takes the clustering result as input and outputs a higher or lower value indicating how 'good' the result is. You then optimize the clustering so that the objective function is minimized or maximized.

The training data in this case would just be the data you already have. The ground truth would be whatever the result would converge to if you had a large enough and representative enough sample to minimize the objective function over. The more data you have, the closer you can get to the ground truth in that sense.

There might be multiple optimal solutions as well.
 
Last edited:
  • Like
Likes fog37
  • #18
jarvis323,

Thanks. To make sure, since the training data is not labelled, the developer must, subjectively define an objective function to create a way to assess the performance of the unsupervised model during the training phase.

For example, in recommendation systems used by companies like Amazon, etc., the customers' clustering obtained via unsupervised training and learning, can be obtained using customers' types of purchases, age, location, etc. It seems difficult to judge if that clustering is "correct" or "incorrect". How would an mathematical objective function measure the correctness?
Say, for example, that a customer is placed in cluster B instead of A...I guess we could compare those results with the history of other similar customers but that sound like a little like training with labelled data since we are relying on historical data. I am rumbling at this point :)
 
  • #19
fog37 said:
jarvis323,

Thanks. To make sure, since the training data is not labelled, the developer must, subjectively define an objective function to create a way to assess the performance of the unsupervised model during the training phase.

For example, in recommendation systems used by companies like Amazon, etc., the customers' clustering obtained via unsupervised training and learning, can be obtained using customers' types of purchases, age, location, etc. It seems difficult to judge if that clustering is "correct" or "incorrect". How would an mathematical objective function measure the correctness?
Say, for example, that a customer is placed in cluster B instead of A...I guess we could compare those results with the history of other similar customers but that sound like a little like training with labelled data since we are relying on historical data. I am rumbling at this point :)
I think that ultimately, for those companies, the objective function is how much money they make. Theoretically, maybe they could have swindled you for everything you have, so the ground truth is not clear. But they will be optimizing towards making more. Whether you should have been in cluster B instead of A depends on if you bought that thing they advertised to you or not, for example. They have lots of purchase histories, and know lots of things about people, so they actually have a lot of labeled training data. In some cases, they just go with what has proven to work. Sometimes people just do lots of trial and error, and eventually they figure out a good objective function that makes them a lot of money.

With the Netflix recommendation competition, the goal is to predict a persons movie rating, and they have lots of customer ratings to use as labeled data. SVD of all things ended up being the backbone of the most successful approach. But notice they also tack on some extra terms in the objective function that are not just based on the labels.

https://towardsdatascience.com/reco...-decomposition-svd-truncated-svd-97096338f361

Even with supervised learning, it is somewhat common to learn the features and patterns in unsupervised way, maybe implicitly, which is essentially what happens in a neural network.

If you have trained a model to classify digits, then you have also trained the model to learn the ‘structural similarities’ between images. In fact, this is how the model is able to classify digits in the first place- by learning the features of each digit.

If it seems that this process is ‘hidden’ from you, it’s because it is. Latent, by definition, means “hidden.”

https://towardsdatascience.com/unde...machine-learning-de5a7c687d8d?gi=50877eafb53c
 
Last edited:
  • #20
fog37 said:
Since you mention Keras and Tensorflow, I read that Keras operate on top of Tensorflow. What does that really mean? Aren't they both two ML libraries with methods that can be used in coding? Scikit-learn is another common ML library. How are they different? Maybe Keras and Tensorflow are specific for neural networks?
Keras is described as an API (application protocol interface) but I am not sure what that means and if it is synonym with library...
Was this not fully addressed in this question of yours 3 weeks ago?
 
  • #21
Hello pbuk, sorry if that is the case. I will check...I must have spaced out.
 
  • #22
fog37 said:
Thank you. I was just reading about KNN which is a supervised machine learning model. All supervised models, as you and FactChecker mentioned, receive labeled input information to use for training and testing and learn. In supervised ML, the labelled data can be viewed as training the model using flash cards, or a set of correct questions with the associated answers, etc. from which the model learns to be prepared to face unknown new data and make correct predictions or do classify correctly.

For clustering, which is the goal of unsupervised ML models, the unsupervised model does not start with labelled data. Still, if I understand correctly, we must also split the initial unlabeled data into a training and testing sets for unsupervised models. But how do we check if the unsupevised ML model is doing a good job if we don't know what the correct answers are a priori? How do we know that the clusters it form are correct without anything to compare against? Thanks for the patience
You use a loss function and choose a threshold to decide whether you are close-enough.
 
  • #23
FactChecker said:
The subjects are different enough that one can be expert in one and know very little about the other. In general, ML allows much more flexibility in the modeling but in doing so it loses the use of many powerful theorems that are available in most statistics
As I understand it, standard statistics gives priority to accurate estimation of population parameters while in ML you're more interested in prediction. Not sure how this makes a difference. By tuning parameters, I assume it means changing the values of parameters in activation function to minimize the loss function. I believe this is done using Gradient Descent among other methods.
 
  • #24
WWGD said:
As I understand it, standard statistics gives priority to accurate estimation of population parameters while in ML you're more interested in prediction. Not sure how this makes a difference. By tuning parameters, I assume it means changing the values of parameters in activation function to minimize the loss function. I believe this is done using Gradient Descent among other methods.
Normally statistics will try to determine probabilities for the answers that it gives. That is done using powerful statistical theorems. ML often has complicated non-linear and even discontinuous steps in its process that prevent the application of statistical theorems.
 
  • #25
There are efforts to estimate errors in machine learning.

http://proceedings.mlr.press/v48/gal16.html
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Yarin Gal, Zoubin Ghahramani

https://arxiv.org/abs/1708.08843
Uncertainties in Parameters Estimated with Neural Networks: Application to Strong Gravitational Lensing
Laurence Perreault Levasseur, Yashar D. Hezaveh, Risa H. Wechsler

https://arxiv.org/abs/2007.04176
Detection of Gravitational Waves Using Bayesian Neural Networks
Yu-Chiung Lin, Jiun-Huei Proty Wu
 
Last edited:
  • Like
  • Informative
Likes WWGD and FactChecker
  • #26
Certainly, statistical methods can be applied to the results of ML, just as they can be applied to many other things. Statistical analysis and ML are not mutually exclusive. But ML is not a subset of statistical analysis. In fact, many ML techniques make it very hard, or impractical, to carry statistical theory through the algorithms. I suppose there are exceptions.
 
  • Like
Likes atyy
  • #27
FactChecker said:
Certainly, statistical methods can be applied to the results of ML, just as they can be applied to many other things. Statistical analysis and ML are not mutually exclusive. But ML is not a subset of statistical analysis. In fact, many ML techniques make it very hard, or impractical, to carry statistical theory through the algorithms. I suppose there are exceptions.
Yes, I wasn't disagreeing with what you said, just bringing up some efforts that try to make things better. I think they are largely still research directions rather than plug and play.
 
  • Like
Likes FactChecker
  • #28
atyy said:
Yes, I wasn't disagreeing with what you said, just bringing up some efforts that try to make things better. I think they are largely still research directions rather than plug and play.
Yes. I thought your links were very informative on that subject. It is worth studying.
 
  • #29
Jarvis323 said:
The benchmark for performance is the objective function.

Essentially, you as the developer of the algorithm decides what is 'correct' per say, by creating a function that takes the clustering result as input and outputs a higher or lower value indicating how 'good' the result is. You then optimize the clustering so that the objective function is minimized or maximized.

The training data in this case would just be the data you already have. The ground truth would be whatever the result would converge to if you had a large enough and representative enough sample to minimize the objective function over. The more data you have, the closer you can get to the ground truth in that sense.

There might be multiple optimal solutions as well.
I am assuming here some type of Anova would work: there should be little variability within classes; less than the variability between classes. I remember reading articles from this perspective used to classify dog breeds and test classification schemes like the one used for generations: "Baby Boom", "X", " Millenial" , etc. Is this scheme reasonable/helpful? But can't remember where I read it.
 
  • #30
WWGD said:
I am assuming here some type of Anova would work: there should be little variability within classes; less than the variability between classes. I remember reading articles from this perspective used to classify dog breeds and test classification schemes like the one used for generations: "Baby Boom", "X", " Millenial" , etc. Is this scheme reasonable/helpful? But can't remember where I read it.
I think it's reasonable. But it also depends on the nature of the data, the associations between features and labels and the statistical assumptions you can make.
 

1. What is the difference between clustering and classification in machine learning?

Clustering is an unsupervised learning technique where data is grouped into clusters based on their similarities, while classification is a supervised learning technique where data is labeled and grouped into predefined categories.

2. How do clustering and classification algorithms work?

Clustering algorithms use mathematical techniques to group data points based on their similarities, while classification algorithms use statistical methods to classify data into predefined categories based on their features.

3. What is the purpose of using clustering and classification in machine learning?

The purpose of using clustering and classification in machine learning is to organize and make sense of large amounts of data, identify patterns and relationships, and make predictions or decisions based on the data.

4. What are some common clustering and classification algorithms used in machine learning?

Some common clustering algorithms include k-means, hierarchical clustering, and DBSCAN. Common classification algorithms include decision trees, logistic regression, and support vector machines.

5. What are the advantages and disadvantages of using clustering and classification in machine learning?

The advantages of using clustering and classification in machine learning include the ability to handle large datasets, identify patterns and relationships in the data, and make predictions or decisions based on the data. However, the main disadvantage is that the accuracy of the results depends heavily on the quality of the data and the chosen algorithm.

Similar threads

  • Programming and Computer Science
Replies
22
Views
921
  • Programming and Computer Science
Replies
1
Views
928
  • Programming and Computer Science
Replies
7
Views
1K
  • Programming and Computer Science
2
Replies
63
Views
9K
  • Programming and Computer Science
Replies
6
Views
850
  • STEM Academic Advising
Replies
5
Views
860
  • Programming and Computer Science
Replies
10
Views
2K
  • STEM Academic Advising
Replies
1
Views
1K
Replies
26
Views
3K
  • STEM Academic Advising
Replies
9
Views
1K
Back
Top