Clustering and Classification in Machine Learning

fog37 · Jul 1, 2020

Hello,

I started studying machine learning and its basics. One of the applications of supervised learning ML is classification, i.e. identifying different objects and classify them. Being supervised means that the ML algorithm is initially served labelled data. The labelled data is used to a) train the model and b) validate the model, i.e. check how good the model is since we know the correct answers a priori...

In unsupervised ML, the main goal is clustering and finding patterns. But isn't clustering the same thing as classification? Both are about grouping entities in different sets based on their similar characteristics. Unsupervised ML, however, does not received labelled data to be trained. I still think that an unsupervised ML model must be trained and also evaluated. How can the model be evaluated if we don't provide it with the correct answers, which sounds the same as labelled data, to check against? Is my understanding correct?

FactChecker · Jul 1, 2020

The term "clustering" implies that there are no pre-defined classes. The ML has complete freedom to define groups that best fit the clusters that it sees in the data. The term "classification" implies that there are a pre-defined set of group definitions that the ML should fit the data into.

jedishrfu · Jul 1, 2020

Yes, classification is done with known data that is labeled. You split it into two groups one for training and one for testing. I think a common split is 80% training and 20% testing. You retrain and retest multiple times to get the accuracy up. Accuracy is determined by how often the ML matches the data label during the tests.

In KNN clustering for example, you construct a distance metric and the ML algorithm discovers what items are close together. The ML doesn't know why they are clustered only that the metric says they are similar. The most common example would be in grouping houses for sale where the distance metric might include price, sqft, location to schools, ... The metric groups houses together by their closeness in the metric. A buyer interested in one house can now look are other similar houses.

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68#:~:text=Clustering is a Machine Learning,the grouping of data points.&text=In theory, data points that,dissimilar properties and/or features.

fog37 · Jul 1, 2020

jedishrfu said:

Yes, classification is done with known data that is labeled. You split it into two groups one for training and one for testing. I think a common split is 80% training and 20% testing. You retrain and retest multiple times to get the accuracy up. Accuracy is determined by how often the ML matches the data label during the tests.

In KNN clustering for example, you construct a distance metric and the ML algorithm discovers what items are close together. The ML doesn't know why they are clustered only that the metric says they are similar. The most common example would be in grouping houses for sale where the distance metric might include price, sqft, location to schools, ... The metric groups houses together by their closeness in the metric. A buyer interested in one house can now look are other similar houses.

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68#:~:text=Clustering is a Machine Learning,the grouping of data points.&text=In theory, data points that,dissimilar properties and/or features.

Thank you. I was just reading about KNN which is a supervised machine learning model. All supervised models, as you and FactChecker mentioned, receive labeled input information to use for training and testing and learn. In supervised ML, the labelled data can be viewed as training the model using flash cards, or a set of correct questions with the associated answers, etc. from which the model learns to be prepared to face unknown new data and make correct predictions or do classify correctly.

For clustering, which is the goal of unsupervised ML models, the unsupervised model does not start with labelled data. Still, if I understand correctly, we must also split the initial unlabeled data into a training and testing sets for unsupervised models. But how do we check if the unsupevised ML model is doing a good job if we don't know what the correct answers are a priori? How do we know that the clusters it form are correct without anything to compare against? Thanks for the patience

jedishrfu · Jul 1, 2020

KNN can be used in either classification or clustering. In the case of clustering, the metric groups things and you research the group to decide what it represents.

in a banking application, KNN might create several groups of customers and by looking at how they are different from the whole group of customers via the customers attributes money in bank, checking, savings, age, marital status ...

One group might stand out and you notice It’s members are married with kids going to college and so you decide how to market banking products to them based on that info.

You have to be careful though as individual members of the group may not have all the attributes of the group. As an example, they may not have kids and so you wouldn’t want to market a banking product focused on the kids like kids insurance.

fog37 · Jul 3, 2020

Many thanks, jedishrfu.

One more question: linear or logistic regression is one application of supervised ML where the goal is predicting a numerical label once the model is trained using known examples (training set) and testing it with another set of known examples (validation set).

But isn't regression a common topic in statistics? So what is special about doing regression using machine learning? Is doing regression with ML any different? Is it just glorified statistics?

Thanks

FactChecker · Jul 3, 2020

fog37 said:

But isn't regression a common topic in statistics? So what is special about doing regression using machine learning? Is doing regression with ML any different? Is it just glorified statistics?

The subjects are different enough that one can be expert in one and know very little about the other. In general, ML allows much more flexibility in the modeling but in doing so it loses the use of many powerful theorems that are available in most statistics

jedishrfu · Jul 3, 2020

This is what happens in many fields as they mature they begin to assimilate other topics. Initially, there was AI (trying to mimic intelligence) which branched off into blue sky stuff with lisp, expert systems in prolog (a topic expert's knowledge coded into prolog rules) and neural nets (reminiscent of the ghost busters scene with the pretty blond and the shock treatment given to the other student only given to the neural net when it gets things wrong) to learn from data.

Later came data mining and now machine learning is giving way to deep learning as we add stages and more stages to prep and reformat our data. Through all these transformations you can see a strong thread of statisitics.

Later folks realized that finding the best fit line through some data was a form of learning (we train to find the slope and offset and then later use it to predict and later still t o improve the prediction via continuous training) and so statistical methods got incorporated in ML and became known as statistical learning. The ML field is still evolving, developing all the various tricks of the trade that can be brought to bear on any data driven learning project.

Success often depends on how well you can reduce the attributes you measure and how well you can tune your models.

There's one really good book on all this:

Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow by Aureleon Geron that I highly recommend which can really get you up to speed via ML in Python. Much of today's ML is selecting the best strategy for the given problem, using other peoples implementation of algorithms and then evaluating how well your model performs in the real world.

https://www.amazon.com/dp/1492032646/?tag=pfamazon01-20

atyy · Jul 4, 2020

fog37 said:

But isn't regression a common topic in statistics? So what is special about doing regression using machine learning? Is doing regression with ML any different? Is it just glorified statistics?

Yes, both involve model fitting, and there is no sharp line between them. ML tends to have more variables, and the optimization is more often non-convex. It is very helpful to think of ML as the same subject as statistics. Analogously, the ML field of reinfocement learning is closely related to the traditional discipline of control theory.

fog37 · Jul 4, 2020

jedishrfu said:

This is what happens in many fields as they mature they begin to assimilate other topics. Initially, there was AI (trying to mimic intelligence) which branched off into blue sky stuff with lisp, expert systems in prolog (a topic expert's knowledge coded into prolog rules) and neural nets (reminiscent of the ghost busters scene with the pretty blond and the shock treatment given to the other student only given to the neural net when it gets things wrong) to learn from data.

Later came data mining and now machine learning is giving way to deep learning as we add stages and more stages to prep and reformat our data. Through all these transformations you can see a strong thread of statisitics.

Later folks realized that finding the best fit line through some data was a form of learning (we train to find the slope and offset and then later use it to predict and later still t o improve the prediction via continuous training) and so statistical methods got incorporated in ML and became known as statistical learning. The ML field is still evolving, developing all the various tricks of the trade that can be brought to bear on any data driven learning project.

Success often depends on how well you can reduce the attributes you measure and how well you can tune your models.

There's one really good book on all this:

Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow by Aureleon Geron that I highly recommend which can really get you up to speed via ML in Python. Much of today's ML is selecting the best strategy for the given problem, using other peoples implementation of algorithms and then evaluating how well your model performs in the real world.

https://www.amazon.com/dp/1492032646/?tag=pfamazon01-20

Thank you jedishrfu. I appreciate how you tied things together.

For instance, Deep Learning is often presented as a distinguished concept from ML. However, I think they are intimately related since deep learning is essentially neural networks with many hidden layers. Also, deep learning models seem to require less structured input data while ML models (supervised, unsupervised, reinforced learning) expect the input data to be clean, well organized and structure.

Since you mention Keras and Tensorflow, I read that Keras operate on top of Tensorflow. What does that really mean? Aren't they both two ML libraries with methods that can be used in coding? Scikit-learn is another common ML library. How are they different? Maybe Keras and Tensorflow are specific for neural networks?
Keras is described as an API (application protocol interface) but I am not sure what that means and if it is synonym with library...

Jarvis323 · Jul 4, 2020

fog37 said:

How do we know that the clusters it form are correct without anything to compare against? Thanks for the patience

There is an objective function which you are trying to minimize or maximize. I guess the name is pretty explanatory. You have some objective, and you define some function which tells you how well a candidate solution meets your objective.

In classification the objective is to guess the labels well, and there are different objective functions that are used. In dimensionality reduction and manifold learning the objective is to find a lower dimensional space where the relative structure of the data is preserved. In clustering, there are different methods with different objectives. Often these methods were developed by first thinking what they want to achieve. people will say something like "The clusters should be low in number, and the variance should be small in each cluster", or whatever they think it means to be a good clustering for whatever purpose they have for it. Then they try to find an objective function which captures that intuition. So they may add terms to the function that reward some properties and penalize others.

In some cases, you have multiple objectives that are captured by different functions. In those cases, there may be an ambiguity of what is an optimal solution. So there is a notion called Pareto optimal. A Pareto optimal solution is a solution where you cannot do better in one objective without sacrificing in another. https://en.wikipedia.org/wiki/Pareto_efficiency .

jedishrfu · Jul 4, 2020

Working backward, an API is short for application programming interface. Keras has an api to call Tensorflow functions to simplify some of the work in making an ML application.

sometimes programmers notice that a given product like TF requires the same template of function calls and related glue code (like a recipe) to get something done and so they implement it as a function itself. This function and other related ones become the basis for the Keras API.

Heres an article on ML and DL and their differences. Basically, a DL is composed of smaller ML stages that extract features from your data that are fed into later stages of the DL before it makes a decision.

https://www.zendesk.com/blog/machine-learning-and-deep-learning/

Jarvis323 · Jul 4, 2020

jedishrfu said:

Heres an article on ML and DL and their differences. Basically, a DL is composed of smaller ML stages that extract features from your data that are fed into later stages of the DL before it makes a decision.

I guess that is a good way to think of it. My understanding has been through the concept of the length of the credit assignment path. The surprising thing from a theoretical perspective is that one stage could theoretically provide an equivalent shallow model. The depth is actually something which is necessary for learning. It is thought to be similar to how humans learn. We use hierachical generalizations and abstractions, and we associate properties with them at each level, so we can traverse so to speak to the appropriate level quickly with as little information and processing as possible. Another approach to trying to explain it is with information theory and the information bottleneck principle. It hypothesizes that deep learning works by compressing the information to squeeze through the layers while maximizing the mutual information between the input and the objective. By doing this, the model learns these stages of features like you say, which may encode hierarchical properties that support an efficient/simple way/model to get to the solutions.

Surprisingly, deep learning (why depth is necessary) is still not explained robustly by theory. And the training process of neural networks, when modeled as a dynamical system, has been shown to exhibit chaos.

jedishrfu · Jul 4, 2020

Yeah a lot of ML and DL reduces to statistical interpretation meaning we may never understand why it works so well and why it can fail so dramatically.

fog37 · Dec 20, 2020

Hello again,

In this discussion about clustering, which falls into the field of unsupervised learning (since the training data has no labels, i.e. we don't know the "ground truth"), how is the provided training data actually training the clustering algorithm, like the KNN algorithm?

I believe the programmer can initially set:
a) the number of desired clusters
b) which features will be used to form those clusters

For examples, in the case of a group of different flowers with the flowers having 7 different attributes (color, smell, weight, petal length, sepal length, etc.), we would probably use ALL available attributes to generate the "best" clusters OR we could select only certain attributes to form the clusters and separation between the flowers.

How do we even know that those clusters are good? There is not benchmark for performance or correctness in unsupervised learning...How is the "training data" really training the algorithm in the right direction and
making it better and better?

Is training data, in the case of unsupervised learning models, data for which we know a priori which correct clusters exist?

All ML models, including unsupervised and reinforcement learning models, need to be trained before being deployed...

FactChecker · Dec 20, 2020

In deep neural networks, it is difficult or impossible to know what is going on in the intermediate levels that lead to the final classification. The mathematical methods are known, but the training results are hard to interpret.

Jarvis323 · Dec 20, 2020

fog37 said:

How do we even know that those clusters are good? There is not benchmark for performance or correctness in unsupervised learning...How is the "training data" really training the algorithm in the right direction and
making it better and better?

Is training data, in the case of unsupervised learning models, data for which we know a priori which correct clusters exist?

All ML models, including unsupervised and reinforcement learning models, need to be trained before being deployed...

The benchmark for performance is the objective function.

Essentially, you as the developer of the algorithm decides what is 'correct' per say, by creating a function that takes the clustering result as input and outputs a higher or lower value indicating how 'good' the result is. You then optimize the clustering so that the objective function is minimized or maximized.

The training data in this case would just be the data you already have. The ground truth would be whatever the result would converge to if you had a large enough and representative enough sample to minimize the objective function over. The more data you have, the closer you can get to the ground truth in that sense.

There might be multiple optimal solutions as well.

fog37 · Dec 20, 2020

jarvis323,

Thanks. To make sure, since the training data is not labelled, the developer must, subjectively define an objective function to create a way to assess the performance of the unsupervised model during the training phase.

For example, in recommendation systems used by companies like Amazon, etc., the customers' clustering obtained via unsupervised training and learning, can be obtained using customers' types of purchases, age, location, etc. It seems difficult to judge if that clustering is "correct" or "incorrect". How would an mathematical objective function measure the correctness?
Say, for example, that a customer is placed in cluster B instead of A...I guess we could compare those results with the history of other similar customers but that sound like a little like training with labelled data since we are relying on historical data. I am rumbling at this point :)

Jarvis323 · Dec 20, 2020

fog37 said:

jarvis323,

Thanks. To make sure, since the training data is not labelled, the developer must, subjectively define an objective function to create a way to assess the performance of the unsupervised model during the training phase.

For example, in recommendation systems used by companies like Amazon, etc., the customers' clustering obtained via unsupervised training and learning, can be obtained using customers' types of purchases, age, location, etc. It seems difficult to judge if that clustering is "correct" or "incorrect". How would an mathematical objective function measure the correctness?
Say, for example, that a customer is placed in cluster B instead of A...I guess we could compare those results with the history of other similar customers but that sound like a little like training with labelled data since we are relying on historical data. I am rumbling at this point :)

I think that ultimately, for those companies, the objective function is how much money they make. Theoretically, maybe they could have swindled you for everything you have, so the ground truth is not clear. But they will be optimizing towards making more. Whether you should have been in cluster B instead of A depends on if you bought that thing they advertised to you or not, for example. They have lots of purchase histories, and know lots of things about people, so they actually have a lot of labeled training data. In some cases, they just go with what has proven to work. Sometimes people just do lots of trial and error, and eventually they figure out a good objective function that makes them a lot of money.

With the Netflix recommendation competition, the goal is to predict a persons movie rating, and they have lots of customer ratings to use as labeled data. SVD of all things ended up being the backbone of the most successful approach. But notice they also tack on some extra terms in the objective function that are not just based on the labels.

https://towardsdatascience.com/reco...-decomposition-svd-truncated-svd-97096338f361

Even with supervised learning, it is somewhat common to learn the features and patterns in unsupervised way, maybe implicitly, which is essentially what happens in a neural network.

If you have trained a model to classify digits, then you have also trained the model to learn the ‘structural similarities’ between images. In fact, this is how the model is able to classify digits in the first place- by learning the features of each digit.

If it seems that this process is ‘hidden’ from you, it’s because it is. Latent, by definition, means “hidden.”

https://towardsdatascience.com/unde...machine-learning-de5a7c687d8d?gi=50877eafb53c

pbuk · Dec 20, 2020

fog37 said:

Since you mention Keras and Tensorflow, I read that Keras operate on top of Tensorflow. What does that really mean? Aren't they both two ML libraries with methods that can be used in coding? Scikit-learn is another common ML library. How are they different? Maybe Keras and Tensorflow are specific for neural networks?
Keras is described as an API (application protocol interface) but I am not sure what that means and if it is synonym with library...

Was this not fully addressed in this question of yours 3 weeks ago?

fog37 · Dec 21, 2020

Hello pbuk, sorry if that is the case. I will check...I must have spaced out.

WWGD · Jul 2, 2021

fog37 said:

Thank you. I was just reading about KNN which is a supervised machine learning model. All supervised models, as you and FactChecker mentioned, receive labeled input information to use for training and testing and learn. In supervised ML, the labelled data can be viewed as training the model using flash cards, or a set of correct questions with the associated answers, etc. from which the model learns to be prepared to face unknown new data and make correct predictions or do classify correctly.

For clustering, which is the goal of unsupervised ML models, the unsupervised model does not start with labelled data. Still, if I understand correctly, we must also split the initial unlabeled data into a training and testing sets for unsupervised models. But how do we check if the unsupevised ML model is doing a good job if we don't know what the correct answers are a priori? How do we know that the clusters it form are correct without anything to compare against? Thanks for the patience

You use a loss function and choose a threshold to decide whether you are close-enough.

WWGD · Jul 2, 2021

FactChecker said:

The subjects are different enough that one can be expert in one and know very little about the other. In general, ML allows much more flexibility in the modeling but in doing so it loses the use of many powerful theorems that are available in most statistics

As I understand it, standard statistics gives priority to accurate estimation of population parameters while in ML you're more interested in prediction. Not sure how this makes a difference. By tuning parameters, I assume it means changing the values of parameters in activation function to minimize the loss function. I believe this is done using Gradient Descent among other methods.

FactChecker · Jul 2, 2021

WWGD said:

As I understand it, standard statistics gives priority to accurate estimation of population parameters while in ML you're more interested in prediction. Not sure how this makes a difference. By tuning parameters, I assume it means changing the values of parameters in activation function to minimize the loss function. I believe this is done using Gradient Descent among other methods.

Normally statistics will try to determine probabilities for the answers that it gives. That is done using powerful statistical theorems. ML often has complicated non-linear and even discontinuous steps in its process that prevent the application of statistical theorems.

atyy · Jul 2, 2021

There are efforts to estimate errors in machine learning.

http://proceedings.mlr.press/v48/gal16.html
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Yarin Gal, Zoubin Ghahramani

https://arxiv.org/abs/1708.08843
Uncertainties in Parameters Estimated with Neural Networks: Application to Strong Gravitational Lensing
Laurence Perreault Levasseur, Yashar D. Hezaveh, Risa H. Wechsler

https://arxiv.org/abs/2007.04176
Detection of Gravitational Waves Using Bayesian Neural Networks
Yu-Chiung Lin, Jiun-Huei Proty Wu

FactChecker · Jul 2, 2021

Certainly, statistical methods can be applied to the results of ML, just as they can be applied to many other things. Statistical analysis and ML are not mutually exclusive. But ML is not a subset of statistical analysis. In fact, many ML techniques make it very hard, or impractical, to carry statistical theory through the algorithms. I suppose there are exceptions.

atyy · Jul 3, 2021

FactChecker said:

Certainly, statistical methods can be applied to the results of ML, just as they can be applied to many other things. Statistical analysis and ML are not mutually exclusive. But ML is not a subset of statistical analysis. In fact, many ML techniques make it very hard, or impractical, to carry statistical theory through the algorithms. I suppose there are exceptions.

Yes, I wasn't disagreeing with what you said, just bringing up some efforts that try to make things better. I think they are largely still research directions rather than plug and play.

FactChecker · Jul 3, 2021

atyy said:

Yes, I wasn't disagreeing with what you said, just bringing up some efforts that try to make things better. I think they are largely still research directions rather than plug and play.

Yes. I thought your links were very informative on that subject. It is worth studying.

WWGD · Aug 3, 2021

Jarvis323 said:

The benchmark for performance is the objective function.

Essentially, you as the developer of the algorithm decides what is 'correct' per say, by creating a function that takes the clustering result as input and outputs a higher or lower value indicating how 'good' the result is. You then optimize the clustering so that the objective function is minimized or maximized.

The training data in this case would just be the data you already have. The ground truth would be whatever the result would converge to if you had a large enough and representative enough sample to minimize the objective function over. The more data you have, the closer you can get to the ground truth in that sense.

There might be multiple optimal solutions as well.

I am assuming here some type of Anova would work: there should be little variability within classes; less than the variability between classes. I remember reading articles from this perspective used to classify dog breeds and test classification schemes like the one used for generations: "Baby Boom", "X", " Millenial" , etc. Is this scheme reasonable/helpful? But can't remember where I read it.

Jarvis323 · Aug 9, 2021

WWGD said:

I am assuming here some type of Anova would work: there should be little variability within classes; less than the variability between classes. I remember reading articles from this perspective used to classify dog breeds and test classification schemes like the one used for generations: "Baby Boom", "X", " Millenial" , etc. Is this scheme reasonable/helpful? But can't remember where I read it.

I think it's reasonable. But it also depends on the nature of the data, the associations between features and labels and the statistical assumptions you can make.

Clustering and Classification in Machine Learning

1. What is the difference between clustering and classification in machine learning?

2. How do clustering and classification algorithms work?

3. What is the purpose of using clustering and classification in machine learning?

4. What are some common clustering and classification algorithms used in machine learning?

5. What are the advantages and disadvantages of using clustering and classification in machine learning?

Similar threads

Hot Threads

Recent Insights