How to interpret Cost Function?

Richard_Steele · Oct 9, 2016

I am just starting a course about machine learning and I don't know how to interpret the cost function.

When the teacher draws the straight line in the x and y coordinates, it looks like:

I see that theta zero is the start of the straight line (in the left side) in the Y coordinate.
The question is related about what theta one modifies? theta one modifies the inclination?

S.G. Janssens · Oct 9, 2016

Richard_Steele said:

The question is related about what theta one modifies? theta one modifies the inclination?

Yes, also called the slope.

However, how this is to be interpreted in a machine learning context is not clear to me, as I never really understood what that field is about. EDIT: Maybe somebody such as @StatGuy2000 can help you with the interpretation?

Richard_Steele · Oct 9, 2016

Krylov said:

Yes, also called the slope.

However, how this is to be interpreted in a machine learning context is not clear to me, as I never really understood what that field is about.

Of course, so Theta_1 modifies the slope.
But, I don't interpret the measuring scale of the slope. In the examples (middle and right, the graphs I posted on post #1), Theta_1 = 0.5.
What kind of unit is 0.5? Degrees? I don't know how to interpret 0.5 with the slope of the straight line.

Responding to your question about machine learning, this algorithm is used in linear regression. You give a dataset to your algorithm (example: x=house's size and y=price of the house), and you have to calculate a straight line that better fits to your dataset. Then, you give to the algorithm values of X (size of the house) and the software calculates the value of Y (price of the house). This is used in supervised learning. Supervised means that you give the correct answers to the software and the software can learns from data. Unsupervised learning means that you give data to the software, but nos classified data (no correct answer given in the data).

mfb · Oct 9, 2016

I moved the thread to our homework section, as it is homework-like.

Richard_Steele said:

What kind of unit is 0.5? Degrees?

It does not have units. It means if x (the horizontal coordinate) increases by 1, then h (the vertical coordinate) increases by 0.5. You can see that if you check the function definition and increase x by 1.

Ray Vickson · Oct 9, 2016

Richard_Steele said:

I am just starting a course about machine learning and I don't know how to interpret the cost function.

When the teacher draws the straight line in the x and y coordinates, it looks like:

I see that theta zero is the start of the straight line (in the left side) in the Y coordinate.
The question is related about what theta one modifies? theta one modifies the inclination?

Just look at the three diagrams. What do you see when you compare the first two? To anchor your understanding, try the following little exercise for yourself: plot the line y=1+x, which uses ##\theta_0=1, \theta_1=1##. Compare that new line with the third line plotted above. What do you see?

Richard_Steele · Oct 9, 2016

Ray Vickson said:

Just look at the three diagrams. What do you see when you compare the first two? To anchor your understanding, try the following little exercise for yourself: plot the line y=1+x, which uses ##\theta_0=1, \theta_1=1##. Compare that new line with the third line plotted above. What do you see?

I see that the point where the straight line passes through Y is 1 (Y=1). When it increases X in 1 unit, then Y = X + 1, so Y is always one unit higher than X.
Right?

Ray Vickson · Oct 9, 2016

Richard_Steele said:

I see that the point where the straight line passes through Y is 1 (Y=1). When it increases X in 1 unit, then Y = X + 1, so Y is always one unit higher than X.
Right?

Yes, but I asked you to compare the new line with the third line given in post #1. Take a sheet of graph paper; plot both lines on the same sheet. Now tell me what you see.

Richard_Steele · Oct 9, 2016

Ray Vickson said:

Just look at the three diagrams. What do you see when you compare the first two? To anchor your understanding, try the following little exercise for yourself: plot the line y=1+x, which uses ##\theta_0=1, \theta_1=1##. Compare that new line with the third line plotted above. What do you see?

Richard_Steele · Oct 9, 2016

Ray Vickson said:

Yes, but I asked you to compare the new line with the third line given in post #1. Take a sheet of graph paper; plot both lines on the same sheet. Now tell me what you see.

Ok, in a few minutes you will have here the comparations.

Richard_Steele · Oct 9, 2016

Ray Vickson said:

Yes, but I asked you to compare the new line with the third line given in post #1. Take a sheet of graph paper; plot both lines on the same sheet. Now tell me what you see.

I see a variation in the slope. The 1 + 1X has more slope than 1 + 0.5X

Ray Vickson · Oct 9, 2016

Richard_Steele said:

I see a variation in the slope. The 1 + 1X has more slope than 1 + 0.5X

Exactly. When x increases by 1, 1+x increases by 1 but 1 + .5*x increases by 1/2.

It would have been more revealing if you had (as I suggested) plotted both lines on the same sheet of paper. If the software you are using does not allow that, then do it by hand on an actual, physical sheet of paper. Alternatively, you can use the graphing packages in a typical spreadsheet to plot several lines or curves on the same plot.

Richard_Steele · Oct 9, 2016

Ray Vickson said:

Exactly. When x increases by 1, 1+x increases by 1 but 1 + .5*x increases by 1/2.

It would have been more revealing if you had (as I suggested) plotted both lines on the same sheet of paper. If the software you are using does not allow that, then do it by hand on an actual, physical sheet of paper. Alternatively, you can use the graphing packages in a typical spreadsheet to plot several lines or curves on the same plot.

Graphs plotted on sheet of paper.

Ray Vickson · Oct 9, 2016

Richard_Steele said:

Graphs plotted on sheet of paper.

Good; that really does show up the difference most dramatically.

Richard_Steele · Oct 9, 2016

Ray Vickson said:

Good; that really does show up the difference most dramatically.

Yes, it's more clear when I draw both on the same graph.

I am reading about 'minimizing the cost function'. What means minimizing and why is minimization used?

Mark44 · Oct 9, 2016

Richard_Steele said:

Yes, it's more clear when I draw both on the same graph.

Your graphs ought to show the equations; that is, y = 1 + x and y = 1 + 0.5x.

Also, the axes are usually labelled on the positive ends. You have your labels for the x-axis on the negative end.

Richard_Steele said:

I am reading about 'minimizing the cost function'. What means minimizing and why is minimization used?

Any business that manufactures and sells a product is always interested in maximizing its profit. One way to do this is to minimize (make as small as possible) its costs.

Richard_Steele · Oct 9, 2016

Mark44 said:

Your graphs ought to show the equations; that is, y = 1 + x and y = 1 + 0.5x.

Also, the axes are usually labelled on the positive ends. You have your labels for the x-axis on the negative end.

Any business that manufactures and sells a product is always interested in maximizing its profit. One way to do this is to minimize (make as small as possible) its costs.

Thanks for the advices.

I am learning cost function applied to machine learning. I am using it in linear regression. So I don't know if minimization has the same objective in manufacturing and in machine learning.

Mark44 · Oct 9, 2016

Richard_Steele said:

I am learning cost function applied to machine learning. I am using it in linear regression. So I don't know if minimization has the same objective in manufacturing and in machine learning.

I don't know how it's related to machine learning, but maybe how long it takes for a program to learn something? Maybe that's what "cost" means in this situation.

Richard_Steele · Oct 9, 2016

Mark44 said:

I don't know how it's related to machine learning, but maybe how long it takes for a program to learn something? Maybe that's what "cost" means in this situation.

In the video, the teacher is showing a cartesian plane. Horizontal line, X, is the size of the house. Vertical line, Y, is the price of the house. That dataset is called 'the training set' (it contains the correct answers).

Then, with the training set, the program has to calculate the straight line I was asking about in the #1 post in this thread. Of course, after that, its needed to 'minimize the function'. It's something like calculating the best parameters for theta zero and theta one, to produce a straight line that minimizes the error between the Y values (those from the trained dataset) and the h(x) (the hypothesis, the predicted Y value). This predicted Y value is called the prediction or the hypothesis. The only real Y values come from the real dataset (the training dataset).

The question is what minimization does and why we should apply it.

EnumaElish · Oct 9, 2016

mfb said:

I moved the thread to our homework section, as it is homework-like.

It does not have units. It means if x (the horizontal coordinate) increases by 1, then h (the vertical coordinate) increases by 0.5. You can see that if you check the function definition and increase x by 1.

The slope parameter is measured in Y units/X units. In the equation Y = a + b X, b = ∂Y/∂X. If Y is "meters traveled" and X is "seconds of time" then b corresponds to velocity measured in meters per second. (In an estimation context b would be called average velocity or average incremental distance.) If Y is dollars and X is square feet then b is the increase in dollars when area increases 1 sqft. In this case b is measured in dollars per sqft.

Mark44 · Oct 9, 2016

EnumaElish said:

In the equation Y = a + b X, b = ∂Y/∂X

Where Y is a function of one variable, X, the slope would be dY/dX. Of course, in the case, the partial you wrote would be the same as the derivative I wrote. However, as this thread is in the Precalc section, the OP might not be familiar with derivatives of any kind.

Ray Vickson · Oct 9, 2016

Richard_Steele said:

In the video, the teacher is showing a cartesian plane. Horizontal line, X, is the size of the house. Vertical line, Y, is the price of the house. That dataset is called 'the training set' (it contains the correct answers).

Then, with the training set, the program has to calculate the straight line I was asking about in the #1 post in this thread. Of course, after that, its needed to 'minimize the function'. It's something like calculating the best parameters for theta zero and theta one, to produce a straight line that minimizes the error between the Y values (those from the trained dataset) and the h(x) (the hypothesis, the predicted Y value). This predicted Y value is called the prediction or the hypothesis. The only real Y values come from the real dataset (the training dataset).

The question is what minimization does and why we should apply it.

Since no line will fit the data exactly, when can we say that one fit is better, or more accurate than another on average? People have devised several measures of error, and the oldest one (which has been around for centuries) is the so-called squared-error measure. If you have a data set ##(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)## and plot them (as a set of point in xy-space), you might get the feeling that the points should lie on or near a straight line of the form ##y = a + bx##. If somebody tells you values of the intercept ##a## and the slope ##b##, you can calculate the "fitted" values of y, which are ##a + b x_i## for ##i = 1,2 \ldots, n##. In general, some of the fitted values will not agree with the experimental (or observed) values of ##y## at the same ##x##-points, so we will have some errors ##e_i = a + b x_i - y_i## for ##i = 1,2, \ldots, n##.

We would like to measure the "goodness" of the fit by forming some kind of aggregate error measure that depends on all the values of ##e_i##. The (total) squared-error criterion is ##S_2 = e_1^2 + e_2 ^2 + \cdots + e_n^2##. Because the errors are squared, they contribute the same whether they are > 0 or < 0, and again because they are squared, large error values make more of a contribution to the total than do small errors. For example, if ##e_2 = 10 e_1## then ##e_2^2 = 100 e_1^2##, so in a sense, ##e_2## is a hundred times more crucial or important than is ##e_1##.

Anyway, regardless of the justification, the measure ##S_2## as above is one of the standard measures used in making data fits; it is the oldest historically and the easiest to use.

To find a "best" fit in the total squared error sense, we would like to find parameters ##a## and ##b## that make the value of ##S_2## as small as possible. That is, we want to minimize ##S_2##. Because the errors are squared, when we minimize ##S_2## we are trying hard to avoid really large errors.

In recent times, alternative error measures have been proposed, and they sometimes have better properties than ##S_2##. For example, the least-total deviation fit determines ##a## and ##b## so as to minimize the total absolute error ##S_1 = |e_1| + |e_2| + \cdots + |e_n|##. Determining the best ##a,b## values in this case is more involved than in the squared-error case, and is best attacked using the relatively modern tools of linear programming (invented in the late 1940s or early 1950s, essentially). Minimizing ##S_1## can lead to fits that are more tolerant of "outliers", so one or two individual "bad" data points (that can throw off the least-squares fit badly) are de-emphasized in the least-deviation fit; it is almost as though the least-deviation method is smart enough to ignore really, really bad points.

Anyway, those are the kinds of things we aim to minimize, and we do so to try to make the fit as accurate as possible, knowing that 100% accuracy is impossible.

Richard_Steele · Oct 10, 2016

Ray Vickson said:

Since no line will fit the data exactly, when can we say that one fit is better, or more accurate than another on average? People have devised several measures of error, and the oldest one (which has been around for centuries) is the so-called squared-error measure. If you have a data set ##(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)## and plot them (as a set of point in xy-space), you might get the feeling that the points should lie on or near a straight line of the form ##y = a + bx##. If somebody tells you values of the intercept ##a## and the slope ##b##, you can calculate the "fitted" values of y, which are ##a + b x_i## for ##i = 1,2 \ldots, n##. In general, some of the fitted values will not agree with the experimental (or observed) values of ##y## at the same ##x##-points, so we will have some errors ##e_i = a + b x_i - y_i## for ##i = 1,2, \ldots, n##.

We would like to measure the "goodness" of the fit by forming some kind of aggregate error measure that depends on all the values of ##e_i##. The (total) squared-error criterion is ##S_2 = e_1^2 + e_2 ^2 + \cdots + e_n^2##. Because the errors are squared, they contribute the same whether they are > 0 or < 0, and again because they are squared, large error values make more of a contribution to the total than do small errors. For example, if ##e_2 = 10 e_1## then ##e_2^2 = 100 e_1^2##, so in a sense, ##e_2## is a hundred times more crucial or important than is ##e_1##.

Anyway, regardless of the justification, the measure ##S_2## as above is one of the standard measures used in making data fits; it is the oldest historically and the easiest to use.

To find a "best" fit in the total squared error sense, we would like to find parameters ##a## and ##b## that make the value of ##S_2## as small as possible. That is, we want to minimize ##S_2##. Because the errors are squared, when we minimize ##S_2## we are trying hard to avoid really large errors.

In recent times, alternative error measures have been proposed, and they sometimes have better properties than ##S_2##. For example, the least-total deviation fit determines ##a## and ##b## so as to minimize the total absolute error ##S_1 = |e_1| + |e_2| + \cdots + |e_n|##. Determining the best ##a,b## values in this case is more involved than in the squared-error case, and is best attacked using the relatively modern tools of linear programming (invented in the late 1940s or early 1950s, essentially). Minimizing ##S_1## can lead to fits that are more tolerant of "outliers", so one or two individual "bad" data points (that can throw off the least-squares fit badly) are de-emphasized in the least-deviation fit; it is almost as though the least-deviation method is smart enough to ignore really, really bad points.

Anyway, those are the kinds of things we aim to minimize, and we do so to try to make the fit as accurate as possible, knowing that 100% accuracy is impossible.

Just reading your post. It will take time to digest all the information. I will ask you as soon as I finish reading everything.

Richard_Steele · Oct 15, 2016

Ray Vickson

So let's start little by little... let's talk about the first paragraph.
Lets think we have a dataset ##(X_1,Y_1), \ldots, (X_n, Y_n)##. As you said, it is possible that those points in the xy space form groups. So our task is to find a straight line that is like the 'mean' of those points.

I am going to try to explain it better.

Imagine that we have that cloud of points, so we have to calculate the ##y= a + bx## (the straight line). What happens while calculating the straight line? The problem here is that we have to calculate ##a## and ##b## values to get a straight line that contains the lower amount of errors.

But, what is an error?
We know that is impossible to get a straight line that represents with 100% accuracy all the ##x,y## points in the graph.
A "perfect" case would be this one:

It's perfect (without errors), because all the points belong to the straight line. So any prediction that we would do, givin an ##x_n## value, will output an exact ##Y## value, without errors.
The problem starts when we have some "errors", like this one:

The green (I have used green to make it easier to differentiate) points are not fitted to the straight line. Using linear algebra it's impossible to create a straight line that contains all the points of the graph. So, when we try to predict ##Y## values, the output won't be 100% accurate because it will contain some error percentage.
An error will be the next:

Because the predicted value doesn't represent the real value. There is an error between the ##real value## and the ##predicted value##.
If we analyze a huge amount of points, there will be a huge amount of errors, right?

So our task is to calculate the straight line that contains the lower amount of errors, right? For that reason, it's called "minimization", because we are trying to minimize the total amount of errors, right?

Ray Vickson · Oct 15, 2016

Richard_Steele said:

Ray Vickson

So let's start little by little... let's talk about the first paragraph.
Lets think we have a dataset ##(X_1,Y_1), \ldots, (X_n, Y_n)##. As you said, it is possible that those points in the xy space form groups. So our task is to find a straight line that is like the 'mean' of those points.

I am going to try to explain it better.

Imagine that we have that cloud of points, so we have to calculate the ##y= a + bx## (the straight line). What happens while calculating the straight line? The problem here is that we have to calculate ##a## and ##b## values to get a straight line that contains the lower amount of errors.

But, what is an error?
We know that is impossible to get a straight line that represents with 100% accuracy all the ##x,y## points in the graph.
A "perfect" case would be this one:

It's perfect (without errors), because all the points belong to the straight line. So any prediction that we would do, givin an ##x_n## value, will output an exact ##Y## value, without errors.
The problem starts when we have some "errors", like this one:

The green (I have used green to make it easier to differentiate) points are not fitted to the straight line. Using linear algebra it's impossible to create a straight line that contains all the points of the graph. So, when we try to predict ##Y## values, the output won't be 100% accurate because it will contain some error percentage.
An error will be the next:

Because the predicted value doesn't represent the real value. There is an error between the ##real value## and the ##predicted value##.
If we analyze a huge amount of points, there will be a huge amount of errors, right?

So our task is to calculate the straight line that contains the lower amount of errors, right? For that reason, it's called "minimization", because we are trying to minimize the total amount of errors, right?

Yes, but you need to specify how you measure the "total amount of errors". As I said, the classical method is to take the sum of the squares of all the individual errors.

The nice thing about that error measure is that its solution is easy, involving simple formulas for ##a## and ##b## that can be implemented readily on a hand-held calculator if the number of data points is moderate (say no more than about 20). It can be done easily in a spreadsheet even if the number of data points is in the thousands. (In fact, most spreadsheets contain a "least-squares fit" routine that can set up and solve the problem for you more-or-less automatically.)

Richard_Steele · Oct 15, 2016

Ray Vickson said:

Yes, but you need to specify how you measure the "total amount of errors". As I said, the classical method is to take the sum of the squares of all the individual errors.

The nice thing about that error measure is that its solution is easy, involving simple formulas for ##a## and ##b## that can be implemented readily on a hand-held calculator if the number of data points is moderate (say no more than about 20). It can be done easily in a spreadsheet even if the number of data points is in the thousands. (In fact, most spreadsheets contain a "least-squares fit" routine that can set up and solve the problem for you more-or-less automatically.)

Lets start with... How a single error is measured?
I think it is:
Y value from real - Y value from prediction. This gives you a number that is the error, right?

Ray Vickson · Oct 15, 2016

Richard_Steele said:

Lets start with... How a single error is measured?
I think it is:
Y value from real - Y value from prediction. This gives you a number that is the error, right?

More-or-less. There is no general agreement on whether the error ##e## is ##e = y_{\text{true}}-y_{\text{predicted}}## or the opposite, ##e =y_{\text{predicted}} -y_{\text{true}}##. Just make sure that you use the same convention at each and every data point.

Richard_Steele · Oct 15, 2016

Ray Vickson said:

More-or-less. There is no general agreement on whether the error ##e## is ##e = y_{\text{true}}-y_{\text{predicted}}## or the opposite, ##e =y_{\text{predicted}} -y_{\text{true}}##.

Of course, I understand. I have read something about the 'absolute value', so I previously thought about that the order in the substraction doesn't matter (as you explain in your post). And I better understand why it's not really important to use ##Y_predicted - Y_real## or ##Y_real - Y_predicted##. It's needed to always use one or another (just only one). After, when you square the results to measure the error, the negative numbers become positive (maybe one substraction would yield negative number as the result of the error).

So, once we have calculate a single error then we need to calculte all the errors. Of your, you explain to try different values to ##a## and ##b##. The question now is, when we want to minimize ##S_2##, There is a formula to calculate directly the best ##a## and ##b## parameters or do we need to try random paramenters to ##a## and ##b## and select the best one?

Ray Vickson · Oct 15, 2016

Richard_Steele said:

Of course, I understand. I have read something about the 'absolute value', so I previously thought about that the order in the substraction doesn't matter (as you explain in your post). And I better understand why it's not really important to use ##Y_predicted - Y_real## or ##Y_real - Y_predicted##. It's needed to always use one or another (just only one). After, when you square the results to measure the error, the negative numbers become positive (maybe one substraction would yield negative number as the result of the error).

So, once we have calculate a single error then we need to calculte all the errors. Of your, you explain to try different values to ##a## and ##b##. The question now is, when we want to minimize ##S_2##, There is a formula to calculate directly the best ##a## and ##b## parameters or do we need to try random paramenters to ##a## and ##b## and select the best one?

No need for guesswork or trial-and-error; this problem was solved about 200 years ago. Google "least-squares line" or "regression line".

Richard_Steele · Oct 15, 2016

Ray Vickson said:

No need for guesswork or trial-and-error; this problem was solved about 200 years ago. Google "least-squares line" or "regression line".

I am going to Google them. It will take time.

How to interpret Cost Function?

What is a cost function and why is it important in data analysis?

How is the cost function calculated?

What is the significance of the minimum point in a cost function?

How do we interpret the slope of a cost function?

What are some common techniques for interpreting a cost function?

Similar threads

Hot Threads

Recent Insights