# Homework Help: How to interpret Cost Function?

Tags:
1. Oct 9, 2016

### Richard_Steele

I am just starting a course about machine learning and I don't know how to interpret the cost function.

When the teacher draws the straight line in the x and y coordinates, it looks like:

I see that theta zero is the start of the straight line (in the left side) in the Y coordinate.
The question is related about what theta one modifies? theta one modifies the inclination?

2. Oct 9, 2016

### Krylov

Yes, also called the slope.

However, how this is to be interpreted in a machine learning context is not clear to me, as I never really understood what that field is about. EDIT: Maybe somebody such as @StatGuy2000 can help you with the interpretation?

Last edited: Oct 9, 2016
3. Oct 9, 2016

### Richard_Steele

Of course, so Theta_1 modifies the slope.
But, I don't interpret the measuring scale of the slope. In the examples (middle and right, the graphs I posted on post #1), Theta_1 = 0.5.
What kind of unit is 0.5? Degrees? I don't know how to interpret 0.5 with the slope of the straight line.

Responding to your question about machine learning, this algorithm is used in linear regression. You give a dataset to your algorithm (example: x=house's size and y=price of the house), and you have to calculate a straight line that better fits to your dataset. Then, you give to the algorithm values of X (size of the house) and the software calculates the value of Y (price of the house). This is used in supervised learning. Supervised means that you give the correct answers to the software and the software can learns from data. Unsupervised learning means that you give data to the software, but nos classified data (no correct answer given in the data).

4. Oct 9, 2016

### Staff: Mentor

I moved the thread to our homework section, as it is homework-like.

It does not have units. It means if x (the horizontal coordinate) increases by 1, then h (the vertical coordinate) increases by 0.5. You can see that if you check the function definition and increase x by 1.

5. Oct 9, 2016

### Ray Vickson

Just look at the three diagrams. What do you see when you compare the first two? To anchor your understanding, try the following little exercise for yourself: plot the line y=1+x, which uses $\theta_0=1, \theta_1=1$. Compare that new line with the third line plotted above. What do you see?

6. Oct 9, 2016

### Richard_Steele

I see that the point where the straight line passes through Y is 1 (Y=1). When it increases X in 1 unit, then Y = X + 1, so Y is always one unit higher than X.
Right?

7. Oct 9, 2016

### Ray Vickson

Yes, but I asked you to compare the new line with the third line given in post #1. Take a sheet of graph paper; plot both lines on the same sheet. Now tell me what you see.

8. Oct 9, 2016

9. Oct 9, 2016

### Richard_Steele

Ok, in a few minutes you will have here the comparations.

10. Oct 9, 2016

### Richard_Steele

I see a variation in the slope. The 1 + 1X has more slope than 1 + 0.5X

11. Oct 9, 2016

### Ray Vickson

Exactly. When x increases by 1, 1+x increases by 1 but 1 + .5*x increases by 1/2.

It would have been more revealing if you had (as I suggested) plotted both lines on the same sheet of paper. If the software you are using does not allow that, then do it by hand on an actual, physical sheet of paper. Alternatively, you can use the graphing packages in a typical spreadsheet to plot several lines or curves on the same plot.

12. Oct 9, 2016

### Richard_Steele

Graphs plotted on sheet of paper.

13. Oct 9, 2016

### Ray Vickson

Good; that really does show up the difference most dramatically.

14. Oct 9, 2016

### Richard_Steele

Yes, it's more clear when I draw both on the same graph.

I am reading about 'minimizing the cost function'. What means minimizing and why is minimization used?

15. Oct 9, 2016

### Staff: Mentor

Your graphs ought to show the equations; that is, y = 1 + x and y = 1 + 0.5x.

Also, the axes are usually labelled on the positive ends. You have your labels for the x-axis on the negative end.
Any business that manufactures and sells a product is always interested in maximizing its profit. One way to do this is to minimize (make as small as possible) its costs.

16. Oct 9, 2016

### Richard_Steele

I am learning cost function applied to machine learning. I am using it in linear regression. So I don't know if minimization has the same objective in manufacturing and in machine learning.

17. Oct 9, 2016

### Staff: Mentor

I don't know how it's related to machine learning, but maybe how long it takes for a program to learn something? Maybe that's what "cost" means in this situation.

18. Oct 9, 2016

### Richard_Steele

In the video, the teacher is showing a cartesian plane. Horizontal line, X, is the size of the house. Vertical line, Y, is the price of the house. That dataset is called 'the training set' (it contains the correct answers).

Then, with the training set, the program has to calculate the straight line I was asking about in the #1 post in this thread. Of course, after that, its needed to 'minimize the function'. It's something like calculating the best parameters for theta zero and theta one, to produce a straight line that minimizes the error between the Y values (those from the trained dataset) and the h(x) (the hypothesis, the predicted Y value). This predicted Y value is called the prediction or the hypothesis. The only real Y values come from the real dataset (the training dataset).

The question is what minimization does and why we should apply it.

19. Oct 9, 2016

### EnumaElish

The slope parameter is measured in Y units/X units. In the equation Y = a + b X, b = ∂Y/∂X. If Y is "meters traveled" and X is "seconds of time" then b corresponds to velocity measured in meters per second. (In an estimation context b would be called average velocity or average incremental distance.) If Y is dollars and X is square feet then b is the increase in dollars when area increases 1 sqft. In this case b is measured in dollars per sqft.

Last edited: Oct 10, 2016
20. Oct 9, 2016

### Staff: Mentor

Where Y is a function of one variable, X, the slope would be dY/dX. Of course, in the case, the partial you wrote would be the same as the derivative I wrote. However, as this thread is in the Precalc section, the OP might not be familiar with derivatives of any kind.

21. Oct 9, 2016

### Ray Vickson

Since no line will fit the data exactly, when can we say that one fit is better, or more accurate than another on average? People have devised several measures of error, and the oldest one (which has been around for centuries) is the so-called squared-error measure. If you have a data set $(x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$ and plot them (as a set of point in xy-space), you might get the feeling that the points should lie on or near a straight line of the form $y = a + bx$. If somebody tells you values of the intercept $a$ and the slope $b$, you can calculate the "fitted" values of y, which are $a + b x_i$ for $i = 1,2 \ldots, n$. In general, some of the fitted values will not agree with the experimental (or observed) values of $y$ at the same $x$-points, so we will have some errors $e_i = a + b x_i - y_i$ for $i = 1,2, \ldots, n$.

We would like to measure the "goodness" of the fit by forming some kind of aggregate error measure that depends on all the values of $e_i$. The (total) squared-error criterion is $S_2 = e_1^2 + e_2 ^2 + \cdots + e_n^2$. Because the errors are squared, they contribute the same whether they are > 0 or < 0, and again because they are squared, large error values make more of a contribution to the total than do small errors. For example, if $e_2 = 10 e_1$ then $e_2^2 = 100 e_1^2$, so in a sense, $e_2$ is a hundred times more crucial or important than is $e_1$.

Anyway, regardless of the justification, the measure $S_2$ as above is one of the standard measures used in making data fits; it is the oldest historically and the easiest to use.

To find a "best" fit in the total squared error sense, we would like to find parameters $a$ and $b$ that make the value of $S_2$ as small as possible. That is, we want to minimize $S_2$. Because the errors are squared, when we minimize $S_2$ we are trying hard to avoid really large errors.

In recent times, alternative error measures have been proposed, and they sometimes have better properties than $S_2$. For example, the least-total deviation fit determines $a$ and $b$ so as to minimize the total absolute error $S_1 = |e_1| + |e_2| + \cdots + |e_n|$. Determining the best $a,b$ values in this case is more involved than in the squared-error case, and is best attacked using the relatively modern tools of linear programming (invented in the late 1940s or early 1950s, essentially). Minimizing $S_1$ can lead to fits that are more tolerant of "outliers", so one or two individual "bad" data points (that can throw off the least-squares fit badly) are de-emphasized in the least-deviation fit; it is almost as though the least-deviation method is smart enough to ignore really, really bad points.

Anyway, those are the kinds of things we aim to minimize, and we do so to try to make the fit as accurate as possible, knowing that 100% accuracy is impossible.

22. Oct 10, 2016

### Richard_Steele

Just reading your post. It will take time to digest all the information. I will ask you as soon as I finish reading everything.

23. Oct 15, 2016

### Richard_Steele

Ray Vickson

So lets start little by little... lets talk about the first paragraph.
Lets think we have a dataset $(X_1,Y_1), \ldots, (X_n, Y_n)$. As you said, it is possible that those points in the xy space form groups. So our task is to find a straight line that is like the 'mean' of those points.

I am going to try to explain it better.

Imagine that we have that cloud of points, so we have to calculate the $y= a + bx$ (the straight line). What happens while calculating the straight line? The problem here is that we have to calculate $a$ and $b$ values to get a straight line that contains the lower amount of errors.

But, what is an error?
We know that is impossible to get a straight line that represents with 100% accuracy all the $x,y$ points in the graph.
A "perfect" case would be this one:

It's perfect (without errors), because all the points belong to the straight line. So any prediction that we would do, givin an $x_n$ value, will output an exact $Y$ value, without errors.
The problem starts when we have some "errors", like this one:

The green (I have used green to make it easier to differentiate) points are not fitted to the straight line. Using linear algebra it's impossible to create a straight line that contains all the points of the graph. So, when we try to predict $Y$ values, the output won't be 100% accurate because it will contain some error percentage.
An error will be the next:

Because the predicted value doesn't represent the real value. There is an error between the $real value$ and the $predicted value$.
If we analyze a huge amount of points, there will be a huge amount of errors, right?

So our task is to calculate the straight line that contains the lower amount of errors, right? For that reason, it's called "minimization", because we are trying to minimize the total amount of errors, right?

24. Oct 15, 2016

### Ray Vickson

Yes, but you need to specify how you measure the "total amount of errors". As I said, the classical method is to take the sum of the squares of all the individual errors.

The nice thing about that error measure is that its solution is easy, involving simple formulas for $a$ and $b$ that can be implemented readily on a hand-held calculator if the number of data points is moderate (say no more than about 20). It can be done easily in a spreadsheet even if the number of data points is in the thousands. (In fact, most spreadsheets contain a "least-squares fit" routine that can set up and solve the problem for you more-or-less automatically.)

25. Oct 15, 2016