Linear Regression, etc : Standard vs ML Techniques

WWGD · Dec 27, 2017

Hi All,
This is probably trivial. What is the difference between techniques such as Linear/Logistic Regression, others done in ML , when they are done in Standard software packages : Excel ( incl. add-ons ), SPSS, etc? Why use ML algorithms when the same can be accomplished with standard software without the need to write code? EDIT: I assume the ML part comes out to optimizing k through different partitions of the data set into training/test data , computing , in each case the FPs and FNs, for each choice of k computing the error or ROC curves?

StoneTemplePython · Dec 27, 2017

Can you expand on this a bit? In ML you often use off the shelf algorithms. In classical statistics you often use off the shelf algorithms. If you're talking about the result from a basic linear model (presumably minimizing an L1 or L2 norm, possibly with an L1 or L2 norm regularization parameter thrown in), there are lots of different ways to achieve this. In either school of thought, there's no need to start from scratch.

- - - -
Get away from Excel though. Unless a customer demands it, you don't want to use it. It's clunky, closed-source, and has major replication problems (i.e. makes it hard for others to recreate / audit your work). Excel also balks at medium sized amounts of data, and I don't think you get anything like the Blas-LAPACK-MKL speedups you can get in array operations in certain programming languages.

As a simple example: you can do linear programming inside excel, but I wouldn't really recommend it. Much more flexible approaches are available (I reach for Julia here). This naturally generalizes to doing linear regression with an L1 norm.

WWGD · Dec 27, 2017

StoneTemplePython said:

Can you expand on this a bit? In ML you often use off the shelf algorithms. In classical statistics you often use off the shelf algorithms. If you're talking about the result from a basic linear model (presumably minimizing an L1 or L2 norm, possibly with an L1 or L2 norm regularization parameter thrown in), there are lots of different ways to achieve this. In either school of thought, there's no need to start from scratch.

- - - -
Get away from Excel though. Unless a customer demands it, you don't want to use it. It's clunky, closed-source, and has major replication problems (i.e. makes it hard for others to recreate / audit your work). Excel also balks at medium sized amounts of data, and I don't think you get anything like the Blas-LAPACK-MKL speedups you can get in array operations in certain programming languages. EDIT: Are you including XL -Miner in your statement?

As a simple example: you can do linear programming inside excel, but I wouldn't really recommend it. Much more flexible approaches are available (I reach for Julia here). This naturally generalizes to doing linear regression with an L1 norm.

EDIT: Please see my recent rewrite of OP .Thanks, it is mostly a matter of (my) comfort. I know Excel , but not ( _Really Sorry_ ;)) Python only at an intro level.

StoneTemplePython · Dec 27, 2017

WWGD said:

EDIT: Please see my recent rewrite of OP .Thanks, it is mostly a matter of (my) comfort. I know Excel , but not ( _Really Sorry_ ;)) Python only at an intro level.

I think you're mostly talking about regularization stages (i.e. tuning training data against some that is partitioned off in the validation--not testing-- set). Again, this could all be done under the hood in a built in / off the shelf program or algorithm. If you have access to such an algorithm, and aren't pursuing this for pedagogical / learning purposes, then you should be good to use the built-in stuff. It's really a judgment call. Sometimes people recreate stuff from scratch for no good reason.
- - - -
The more I think about it, though, your question could be more general: why bother with validation at all? I suppose in this context validation is an ML concept. There is very nice treatment of this in the book Learning from Data which is a great purchase for $30 if you're in the US (NY?). This is all discussed in chapter 4 "Overfitting", for reason. In general we worry about overfitting, and so a lot of different techniques have been developed to minimize it. Regularization parameters (e.g. Tikhonov) are one way to deal with overfitting in linear models. Having a separate validation set is another way of getting dealing with overfitting. You're basically leaving out some of the training data, and getting a 'sneak peak' of the benefits of using the test data, except you're not using the test data so it doesn't get contaminated. There are other techniques, and ways to combine techniques. But the idea is when you have a highly expressive / powerful model, you need to be deeply concerned about overfitting.
- - - -
When you have the time and interest: I really like the materials from MIT... 6.00.1x and 6.00.2x on edx are terrific intros to python and CS and free:

https://www.edx.org/course/introduction-computer-science-mitx-6-00-1x-11
https://www.edx.org/course/introduction-computational-thinking-data-mitx-6-00-2x-6

FWIW, I actually built a texas hold'em poker MC simulator from scratch, entirely in Excel a few years ago. (It was to help settle disputes with a certain person over odds of getting some more complicated hands -- the direct combinatorics approach started to get tedious and error prone.) The simulator worked but the process was brutal and I realized I didn't have the right tool for the job. This was one of the final straws that pushed me toward proper coding.

WWGD · Dec 27, 2017

StoneTemplePython said:

I think you're mostly talking about regularization stages (i.e. tuning training data against some that is partitioned off in the validation--not testing-- set). Again, this could all be done under the hood in a built in / off the shelf program or algorithm. If you have access to such an algorithm, and aren't pursuing this for pedagogical / learning purposes, then you should be good to use the built-in stuff. It's really a judgment call. Sometimes people recreate stuff from scratch for no good reason.
- - - -
The more I think about it, though, your question could be more general: why bother with validation at all? I suppose in this context validation is an ML concept. There is very nice treatment of this in the book Learning from Data which is a great purchase for $30 if you're in the US (NY?). This is all discussed in chapter 4 "Overfitting", for reason. In general we worry about overfitting, and so a lot of different techniques have been developed to minimize it. Regularization parameters (e.g. Tikhonov) are one way to deal with overfitting in linear models. Having a separate validation set is another way of getting dealing with overfitting. You're basically leaving out some of the training data, and getting a 'sneak peak' of the benefits of using the test data, except you're not using the test data so it doesn't get contaminated. There are other techniques, and ways to combine techniques. But the idea is when you have a highly expressive / powerful model, you need to be deeply concerned about overfitting.
- - - -
When you have the time and interest: I really like the materials from MIT... 6.00.1x and 6.00.2x on edx are terrific intros to python and CS and free:

https://www.edx.org/course/introduction-computer-science-mitx-6-00-1x-11
https://www.edx.org/course/introduction-computational-thinking-data-mitx-6-00-2x-6

FWIW, I actually built a texas hold'em poker MC simulator from scratch, entirely in Excel a few years ago. (It was to help settle disputes with a certain person over odds of getting some more complicated hands -- the direct combinatorics approach started to get tedious and error prone.) The simulator worked but the process was brutal and I realized I didn't have the right tool for the job. This was one of the final straws that pushed me toward proper coding.

Are you including XL-Miner ( Excel Data mining plug-in ) in your assesment ? Thanks, that is what I was wondering, whether the difference between what we can do with Excel and with ML was precisely CV and using that to find optimal choice of k to minimize error ( Say TN/Total or (TP+TN)/Total ) or maximize (TP/total). But, how do we find optimal k from the ROC curve? Do we maximize analytically? Also, do you know of some ML technique to be used with Categorical or non-numerical data? I am thinking Latent constructs. EDIT: I am thinking of a set of Sudokus : Some I can do, some I cannot. I cannot see the difference between the two , so I could try some clustering. But, I can't think of how to quantify the differences , nor how to find numerical coordinates for a choice of Sudoku: these puzzles have the same number of empty squares and no noticeable difference. Sorry if I am being unclear, please feel free to ask me for clarifications.

StoneTemplePython · Dec 27, 2017

WWGD said:

Are you including XL-Miner ( Excel Data mining plug-in ) in your assesment ? Thanks, that is what I was wondering, whether the difference between what we can do with Excel and with ML was precisely CV and using that to find optimal choice of k to minimize error ( Say TN/Total or (TP+TN)/Total ) or maximize (TP/total). But, how do we find optimal k from the ROC curve? Do we maximize analytically?

Haha I hadn't heard of XL-Miner, but the fact that it exists concerns me a bit.

I used to use this free basic analytics suite called Poptools, built by a guy doing pest control in Australia. http://www.poptools.org/about/

I haven't used it in a few years but still have some fond memories of it.

Learning From Data-- page 149 said:

If we could analytically obtain ##E_{cv}##, that would be a big bonus, but analytic results are often difficult to come by for cross validation. One exception is in the case of linear models, where we are able to derive an exact analytic formula for the cross validation estimate

So yes, you may be able to pursue an analytic result here. (In general you just hope for analytic bounds -- in deep learning even this is elusive-- and use heuristics to go from there.) I'm not sure what your k refers to --it could refer to the validation set or K-fold cross validation (they call it V-Fold in the book though). I don't think ROC curve even comes up in that chapter. But you should be able to look up analytic results for linear models, and I trust, follow the derivations.

WWGD · Dec 27, 2017

StoneTemplePython said:

Haha I hadn't heard of XL-Miner, but the fact that it exists concerns me a bit.

I used to use this free basic analytics suite called Poptools, built by a guy doing pest control in Australia. http://www.poptools.org/about/

I haven't used it in a few years but still have some fond memories of it.
So yes, you may be able to pursue an analytic result here. (In general you just hope for analytic bounds -- in deep learning even this is elusive-- and use heuristics to go from there.) I'm not sure what your k refers to --it could refer to the validation set or K-fold cross validation (they call it V-Fold in the book though). I don't think ROC curve even comes up in that chapter. But you should be able to look up analytic results for linear models, and I trust, follow the derivations.

Sorry, I was thinking of k -nearest neighbors. As you may be able to tell, I am pretty new to the whole ML .

StoneTemplePython · Dec 27, 2017

WWGD said:

EDIT: I am thinking of a set of Sudokus : Some I can do, some I cannot. I cannot see the difference between the two , so I could try some clustering. But, I can't think of how to quantify the differences , nor how to find numerical coordinates for a choice of Sudoku: these puzzles have the same number of empty squares and no noticeable difference. Sorry if I am being unclear, please feel free to ask me for clarifications.

Ah, I responded before the EDIT showed up. I was actually thinking about Sudoku in my linear programming example -- albeit not in the way you mean it for classifiers. (The LP/Integer Programming approach uses an indicator variable expansion so that you have ##n^3## terms -- it's possible to flatten and use an LP/MIP solver in excel, but tedious. Much much easier to do directly using Julia with arrays that have 3 dimensions.)

Do you have perhaps links to csv's of say two of these different types of puzzles? I'm not sure I'll have much by ideas but still could be fun to play around with.

----
edit:
the typical approach is to start by doing something much simpler like predicting (in a bayesian sense) survivors from the titanic. This is an intro data set on kaggle.

WWGD · Dec 27, 2017

StoneTemplePython said:

Ah, I responded before the EDIT showed up. I was actually thinking about Sudoku in my linear programming example -- albeit not in the way you mean it for classifiers. (The LP/Integer Programming approach uses an indicator variable expansion so that you have ##n^3## terms -- it's possible to flatten and use an LP/MIP solver in excel, but tedious. Much much easier to do directly using Julia with arrays that have 3 dimensions.)

Do you have perhaps links to csv's of say two of these different types of puzzles? I'm not sure I'll have much by ideas but still could be fun to play around with.

OK, I have some screenshots, let me see if I can turn them into CSVs. It seems like some clustering/unsupervised learning, trying with 2-3 clusters to start with. But I can't think of a metric if we used K-means.

StoneTemplePython · Dec 27, 2017

WWGD said:

OK, I have some screenshots, let me see if I can turn them into CSVs. It seems like some clustering/unsupervised learning, trying with 2-3 clusters to start with. But I can't think of a metric if we used K-means.

Hah. You missed my edit too.

Do you mean K-means (i.e. clustering / unsupervised learning) or K-nearest neighbors (which is a similarity method for supervised learning)?

WWGD · Dec 27, 2017

StoneTemplePython said:

Hah. You missed my edit too.

Do you mean K-means (i.e. clustering / unsupervised learning) or K-nearest neighbors (which is a similarity method for supervised learning)?

I was think of k-means, sorry, but it seems both would require a metric of some sort, right?

WWGD · Dec 27, 2017

What if we used the little paper on matrices and Sudokus I posted a few weeks back? I will give it a try.

StoneTemplePython · Dec 27, 2017

WWGD said:

I was think of k-means, sorry, but it seems both would require a metric of some sort, right?

The language varies, but I think just about everything requires a cost function of some kind that you're trying to minimize.

Unsupervised learning is interesting, but a lot more nebulous than supervised. (E.g. it's even challenging to state what it is we're trying to accomplish in unsupervised learning.)

- - - -
That second edx course 6.00.2x actually has a really cool really simple project at the end involving hierarchical clustering. Basically 'group' the 48 mainland states in the US into say 6 groups, based on some historical weather data (i think just basics like average low and high temp). The clusters mapped almost exactly to what I'd say is west coast vs east coast vs south vs southwest vs midwest vs rockies, or whatever.

my general motto is 'start simple and build' which may suggest not doing unsupervised learning on sudoku puzzles but instead on 'easier' things like that weather problem. That said I enjoy messing around with sudoku puzzles.

I don't know how big a sudoku habit you have. But if you and a friend had records of 100 + puzzles you attempted and solved or didn't solve (or perhaps grabbed a lot of 'easy ones' and 'hard ones' as labelled by some online sudoku enthusiast community), you could turn this into a supervised learning problem -- i.e. you'd have labels ('hard' or 'easy'). If you're going the sudoku route, I guess I'd suggest doing something simpler like supervised learning.

WWGD · Dec 27, 2017

What do you think of the idea of recursion? Once we solve a single square, it seems the properties of the Sudoku change, though not clear on how.

StoneTemplePython · Dec 27, 2017

WWGD said:

What do you think of the idea of recursion? Once we solve a single square, it seems the properties of the Sudoku change, though not clear on how.

What does it mean to solve a single square though? The classic recursive setups that work really well (basically divide and conquer and dynamic programming) aren't all that helpful -- if they were, it wouldn't be an NP Complete problem.

it may be that you mean the 'easy' problems are sufficiently (and 'obviously') constrained that we can be certain that this square has this value which immediately implies this other which implies this other, and such.

In some respects, starting with a single square and seeing how the game changes, does fit my motto of 'start simple and build', so maybe a good way to go?

WWGD · Dec 27, 2017

StoneTemplePython said:

What does it mean to solve a single square though? The classic recursive setups that work really well (basically divide and conquer and dynamic programming) aren't all that helpful -- if they were, it wouldn't be an NP Complete problem.

it may be that you mean the 'easy' problems are sufficiently (and 'obviously') constrained that we can be certain that this square has this value which immediately implies this other which implies this other, and such.

In some respects, starting with a single square and seeing how the game changes, does fit my motto of 'start simple and build', so maybe a good way to go?

No, I mean, the complexity decreases once you know the value of anyone square. Maybe a Sudoku with n-1 unfilled spots is qualitatively different from one with n unfilled spots. Also, I have thought of finding inference rules that allow to conclude under certain setups that a certain number belongs in a given square, e.g., the most trivial being if 8 squares are filled in a 3x3 cube , then the 9th square in the 3x3 is whichever of {1,2,...,9} which does not appear there. A more powerful rule would speed up solutions, though I have no idea of how to go about finding one.

WWGD · Dec 27, 2017

Sorry for the rambling, Python, seems to not be going anywhere useful. Maybe I should drop this until I have an actually-productive idea; I don't think these threads should be used for brain-storming. Feel free to ignore my posts.

WWGD · Dec 31, 2017

Sorry to revive this one. Basically, I am curious as to the actual " learning" in ML. Say in ( untrained) classification. Do ML algorithms loop/iterate and check what happens for different partitions of the data into test and training data ( over, say , all 70-30 partitions) , produce ROC curves for different values of k and find the best fit ( In terms, when doing Binary classifiers, of the ratio TPR/FPR ) ?

StoneTemplePython · Dec 31, 2017

If your question is "What Would R do?", then I guess the answer is maybe?

In general I don't pay that much attention to ROC curves. The name of the game is minimizing your expected out of sample error (##E_{out}##) -- potentially modified by some cost function (e.g. type 1 vs type 2 errors may have asymmetric costs to the model user).

We don't have clean access to ##E_{out}## and instead train on minimizing ##E_{in}## in such a way that it should generalize to ##E_{out}##.

It's kind of hard to say more than that. I don't know what untrained classificiation is. The whole point is to train and make quality predictions. Maybe you mean unsupervised classification, but that isn't a term I've heard before -- maybe you mean clustering. Again, I'd start with supervised learning, though -- it's a lot simpler and easier to say what you're doing and what the goals are.

WWGD · Dec 31, 2017

StoneTemplePython said:

If your question is "What Would R do?", then I guess the answer is maybe?

In general I don't pay that much attention to ROC curves. The name of the game is minimizing your expected out of sample error (##E_{out}##) -- potentially modified by some cost function (e.g. type 1 vs type 2 errors may have asymmetric costs to the model user).

We don't have clean access to ##E_{out}## and instead train on minimizing ##E_{in}## in such a way that it should generalize to ##E_{out}##.

It's kind of hard to say more than that. I don't know what untrained classificiation is. The whole point is to train and make quality predictions. Maybe you mean unsupervised classification, but that isn't a term I've heard before -- maybe you mean clustering. Again, I'd start with supervised learning, though -- it's a lot simpler and easier to say what you're doing and what the goals are.

Yes, sorry, meant unsupervised. I am just being thick and not getting what the learning part is in these algorithms, other than for unsupervised classification. How does, e.g., the regression I do in SPSS differ from a ML algorithm's linear regression?.

StoneTemplePython · Dec 31, 2017

WWGD said:

Yes, sorry, meant unsupervised. I am just being thick and not getting what the learning part is in these algorithms, other than for unsupervised classification. How does, e.g., the regression I do in SPSS differ from a ML algorithm's linear regression?.

For vanilla linear regression (i.e. a kind of supervised learning) ML and classical stats are often the same thing with different jargon. I don't think classical stats does much cross validation? But you have other regularization techniques and parameters (I mentioned Tikhonov in an earlier post) that will be used both in classical stats and Machine Learning.

WWGD · Dec 31, 2017

Re regression in ML, I don't see any place for learning to take place, but, then again, I have been at this just for a few months. We input two quantitative variables X,Y and get as output the line of ( least-squares) best fit. I don't see any aspect to optimize other than minimizing the sum of squares of residuals. Why/how does this require an ML algorithm?: EDIT: Never mind, sorry, just read your last. Please ignore.

StoneTemplePython · Dec 31, 2017

If all you want to do is minimize sum of squares, then you just do that. The people behind ML would say doing so is an ML algorithm.

WWGD · Dec 31, 2017

StoneTemplePython said:

If all you want to do is minimize sum of squares, then you just do that. The people behind ML would say doing so is an ML algorithm.

And what makes an algorithm into an ML algorithm, at least informally?

StoneTemplePython · Dec 31, 2017

WWGD said:

And what makes an algorithm into an ML algorithm, at least informally?

This is getting close to when someone asks what a vector is and is told its an object in a vector space...

a machine learning algorithm is an algorithm that help you do machine learning. Seriously. Machine learning can have lots of definitions, but basically we're talking about learning from data to make predictions (out of sample). There is a material intersection with statistics.

WWGD · Dec 31, 2017

Ok, thanks for your patience; I am still a newbie in this area..

StoneTemplePython · Dec 31, 2017

I'd suggest working through a thoughtful book on the matter.

One of the challenges is people are tribal (should I say they cluster?) and develop their own jargon and ways of acting. So you kind of have disjoint groups of people (statisticians and machine learning folks) acting as if their tasks and techniques are disjoint. In many cases they are quite similar (maybe even the same).
- - - -
You may enjoy Statistical Learning

https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about

A bit too much classical stats and too much emphasis on R for me. But I know a lot of people like this. Both profs are stats professors at Stanford.

(There are also two free pdf books available by the authors.)

DrDu · Jan 12, 2018

WWGD said:

Hi All,
This is probably trivial. What is the difference between techniques such as Linear/Logistic Regression, others done in ML , when they are done in Standard software packages : Excel ( incl. add-ons ), SPSS, etc? Why use ML algorithms when the same can be accomplished with standard software without the need to write code? EDIT: I assume the ML part comes out to optimizing k through different partitions of the data set into training/test data , computing , in each case the FPs and FNs, for each choice of k computing the error or ROC curves?

You should consider to explain your abreviations when using them the first time. Only after reading through half of the answers it became clear to me that most probably you were not talking about maximum likelihood but machine learning.

WWGD · Jan 12, 2018

DrDu said:

You should consider to explain your abreviations when using them the first time. Only after reading through half of the answers it became clear to me that most probably you were not talking about maximum likelihood but machine learning.

Good point, sorry. Unfortunately, it is too late to edit my original. EDIT: I may get my SAA membership revoked. SAA... Society Against Abbreviations...;).

Linear Regression, etc : Standard vs ML Techniques

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Use of AI (ML/DL) in Science

Other than just FizzBuzz to test programmer candidates

File Structure vs Data Structure

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

HTML/CSS Problems with DNS records

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect