Can't we use linear regression for classification/prediction?

Click For Summary

Discussion Overview

The discussion centers around the use of linear regression for classification tasks, particularly whether it can effectively predict categorical outcomes such as yes/no decisions. Participants explore the differences between linear regression and logistic regression, questioning the limitations and implications of using linear regression in this context.

Discussion Character

  • Debate/contested
  • Technical explanation

Main Points Raised

  • Some participants argue that linear regression can be used to predict binary outcomes by establishing a threshold (e.g., if x > 0.5, then y = 1; otherwise, y = 0).
  • Others point out that linear regression is traditionally used for continuous values, while logistic regression is designed for categorical outcomes, suggesting a fundamental difference in application.
  • A participant mentions that using linear regression with nonlinear functions is possible, provided the model incorporates the parameters of linear regression appropriately.
  • Concerns are raised about the validity of using linear regression for classification, particularly regarding the nature of the relationship between variables and the definition of categories.
  • One participant critiques a visual example provided in the discussion, stating that it does not effectively illustrate a threshold curve due to the nature of the input data points.

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of using linear regression for classification. There is no consensus on whether it is a valid approach, with some supporting its use under certain conditions and others arguing against it based on the nature of linear relationships.

Contextual Notes

Participants highlight limitations related to the definition of categories and the nature of relationships in the context of using linear regression for classification. There are unresolved questions about the implications of branching in programming and its relevance to the discussion.

shivajikobardan
Messages
637
Reaction score
54
Homework Statement
difference between logistic and linear regression.
Relevant Equations
none
they say that linear regression is used to predict numerical/continuous values whereas logistic regression is used to predict categorical value. but i think we can predict yes/no from linear regression as well

1651852382099.png


Just say that for x>some value, y=0 otherwise, y=1. What am I missing? What is its difference with this below figure?

1651852434225.png
 
Physics news on Phys.org
Branching can have profound effects on performance: for example, the overhead of accessing code modules not currently quickly available after an if/then/else branch. It not as as simple as you might think a priori. Complicated data analysis like you present can have this kind of problem too.

Here is a somewhat dated, but still very important read from Ulrich Drepper:
https://www.akkadia.org/drepper/cpumemory.pdf

I am sure someone will spout reasons why it is "bad", but your question is down in the weeds and low level db programmers are down there in the weeds with you and have to mess with branching effects all the time. I concede that very high level programming platforms can negate some of this. But the question asked above still remains for us weedy types.

Posted in error.
 
Last edited:
jim mcnamara said:
Branching can have profound effects on performance...
Is this the reply to a different thread?
 
shivajikobardan said:
What is its difference with this below figure?
Are you joking? One is a straight line (hence linear regression), the other is an s-curve.

The second plot is a terrible example of a threshold curve BTW - the input data points should not all be 0 or 1 because in that case there is no need to apply the threshold.
 
@pbuk thanks for the correction.
 
pbuk said:
Are you joking? One is a straight line (hence linear regression), the other is an s-curve.
Ofc I get that. What I am trying to say is why can't we say if x>0.5, y=1 else y=0?
pbuk said:
The second plot is a terrible example of a threshold curve BTW - the input data points should not all be 0 or 1 because in that case there is no need to apply the threshold.
 
shivajikobardan said:
Ofc I get that. What I am trying to say is why can't we say if x>0.5, y=1 else y=0?
You can use linear regression with nonlinear functions as long as the form of the model has the parameters of the linear regression in an acceptable way. The values of the nonlinear functions are the independent variables of the linear regression. i.e. ##Y = a_0 + a_1 f_1(X_1) + a_2 f_2(X_2) + \epsilon## is a model where linear regression can be used to find the ##a_i##s even if the ##f_i##s are nonlinear. The values ##z_{i,j}=f_i(x_j)## are the new independent variables.
One limitation on the use of linear regression for classification is that the classifications often can not be defined by a variable, ##X##. How can the categories (man, woman, dog, cat, duck) be defined by a real variable, ##X##?
 
Last edited:
shivajikobardan said:
What I am trying to say is why can't we say if x>0.5, y=1 else y=0?
Because that is not a linear relationship.
 

Similar threads

Replies
3
Views
3K
Replies
10
Views
9K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 11 ·
Replies
11
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
1
Views
3K
  • · Replies 21 ·
Replies
21
Views
3K