MHB Optimizing Linear Regression Cost Function

Click For Summary
The discussion revolves around optimizing a linear regression cost function and identifying mistakes in differentiation. The user initially misunderstands the notation, thinking that the summation involves column vectors rather than summing elements within a column vector. Clarifications are made regarding the differentiation process, emphasizing the need to apply the product rule correctly. The conversation highlights the importance of understanding matrix operations and notation in machine learning contexts. Ultimately, the user realizes their misunderstanding and seeks further clarification on applying the product rule correctly.
Dethrone
Messages
716
Reaction score
0
I'm trying to optimize the function below, but I'm not sure where I made a mistake. (This is an application in machine learning.)$$J(\theta)=\sum_{i=1}^n \left(\sum_{j=1}^{k}(\theta^Tx^{(i)}-y^{(i)})_j^2\right)$$
where $\theta$ is a $n$ by $k$ matrix and $x$ is a $n$ by 1 matrix.

$$=\sum_{i=1}^n \left((\theta^Tx^{(i)}-y^{(i)})_1^2+...+(\theta^Tx^{(i)}-y^{(i)})_k^2 \right)$$

Differentiating,
$$\pd{J}{\theta_{pq}}=2\sum_{i=1}^n \left( (\theta^Tx^{(i)}-y^{(i)})_1\pd{}{\theta_{pq}}(\theta^Tx^{(i)})_1+...+(\theta^Tx^{(i)}-y^{(i)})_k\pd{}{\theta_{pq}}(\theta^Tx^{(i)})_k\right)$$
But, if we look at the first term, $(\theta^Tx^{(i)}-y^{(i)})_1$ is a $k$ by 1 vector and $\pd{}{\theta_{pq}}(\theta^Tx^{(i)})$ is also a $k$ by 1 vector, so we can't multiply them together...(maybe unless we use tensors...). Where did I make a mistake?
 
Physics news on Phys.org
Nevermind. It turns out I misunderstood my prof's notation.
$\left(\sum_{j=1}^{k}(\theta^Tx^{(i)}-y^{(i)})_j^2\right)$ is apparently summing up the elements in the column vector, and not a sum of column vectors. Not sure if this is standard notation but it wasn't apparent for me.

But, even if I interpret it the way I did in the original post, where is the mistake I made? I'm curious.
 
Last edited:
Hey Rido12! ;)

When we write $x^2$ aren't we multiplying two k by 1 vectors as well?
However, what is meant, is $x^Tx$.
When we differentiate we should apply the product rule to it.

As for summing over j, we're really summing independent measurements.
They can indeed be organized as k by 1 columns in a large matrix.
Still, that's separate from the 'mistake' you mentioned.
 
I like Serena said:
Hey Rido12! ;)

When we write $x^2$ aren't we multiplying two k by 1 vectors as well?
However, what is meant, is $x^Tx$.
When we differentiate we should apply the product rule to it.

As for summing over j, we're really summing independent measurements.
They can indeed be organized as k by 1 columns in a large matrix.
Still, that's separate from the 'mistake' you mentioned.

Hi I like Serena!

Thanks! Can you clarify which terms you were referring to that required the product rule?

EDIT:

Oh, are you talking about $(\theta^Tx^{(i)}-y^{(i)})_1^2=(\theta^Tx^{(i)}-y^{(i)})_1^T (\theta^Tx^{(i)}-y^{(i)})_1$, then apply the product rule from here? Now I feel rather silly for forgetting that (Smoking)
 
Last edited:
Thread 'Problem with calculating projections of curl using rotation of contour'
Hello! I tried to calculate projections of curl using rotation of coordinate system but I encountered with following problem. Given: ##rot_xA=\frac{\partial A_z}{\partial y}-\frac{\partial A_y}{\partial z}=0## ##rot_yA=\frac{\partial A_x}{\partial z}-\frac{\partial A_z}{\partial x}=1## ##rot_zA=\frac{\partial A_y}{\partial x}-\frac{\partial A_x}{\partial y}=0## I rotated ##yz##-plane of this coordinate system by an angle ##45## degrees about ##x##-axis and used rotation matrix to...

Similar threads

Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
3K
Replies
4
Views
1K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
4
Views
4K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 3 ·
Replies
3
Views
1K