MHB Optimizing Linear Regression Cost Function

Dethrone · Jan 25, 2017

I'm trying to optimize the function below, but I'm not sure where I made a mistake. (This is an application in machine learning.)$$J(\theta)=\sum_{i=1}^n \left(\sum_{j=1}^{k}(\theta^Tx^{(i)}-y^{(i)})_j^2\right)$$
where $\theta$ is a $n$ by $k$ matrix and $x$ is a $n$ by 1 matrix.

$$=\sum_{i=1}^n \left((\theta^Tx^{(i)}-y^{(i)})_1^2+...+(\theta^Tx^{(i)}-y^{(i)})_k^2 \right)$$

Differentiating,
$$\pd{J}{\theta_{pq}}=2\sum_{i=1}^n \left( (\theta^Tx^{(i)}-y^{(i)})_1\pd{}{\theta_{pq}}(\theta^Tx^{(i)})_1+...+(\theta^Tx^{(i)}-y^{(i)})_k\pd{}{\theta_{pq}}(\theta^Tx^{(i)})_k\right)$$
But, if we look at the first term, $(\theta^Tx^{(i)}-y^{(i)})_1$ is a $k$ by 1 vector and $\pd{}{\theta_{pq}}(\theta^Tx^{(i)})$ is also a $k$ by 1 vector, so we can't multiply them together...(maybe unless we use tensors...). Where did I make a mistake?

Dethrone · Jan 25, 2017

Nevermind. It turns out I misunderstood my prof's notation.
$\left(\sum_{j=1}^{k}(\theta^Tx^{(i)}-y^{(i)})_j^2\right)$ is apparently summing up the elements in the column vector, and not a sum of column vectors. Not sure if this is standard notation but it wasn't apparent for me.

But, even if I interpret it the way I did in the original post, where is the mistake I made? I'm curious.

I like Serena · Jan 26, 2017

Hey Rido12! ;)

When we write $x^2$ aren't we multiplying two k by 1 vectors as well?
However, what is meant, is $x^Tx$.
When we differentiate we should apply the product rule to it.

As for summing over j, we're really summing independent measurements.
They can indeed be organized as k by 1 columns in a large matrix.
Still, that's separate from the 'mistake' you mentioned.

Dethrone · Jan 26, 2017

I like Serena said:

Hey Rido12! ;)

When we write $x^2$ aren't we multiplying two k by 1 vectors as well?
However, what is meant, is $x^Tx$.
When we differentiate we should apply the product rule to it.

As for summing over j, we're really summing independent measurements.
They can indeed be organized as k by 1 columns in a large matrix.
Still, that's separate from the 'mistake' you mentioned.

Hi I like Serena!

Thanks! Can you clarify which terms you were referring to that required the product rule?

EDIT:

Oh, are you talking about $(\theta^Tx^{(i)}-y^{(i)})_1^2=(\theta^Tx^{(i)}-y^{(i)})_1^T (\theta^Tx^{(i)}-y^{(i)})_1$, then apply the product rule from here? Now I feel rather silly for forgetting that (Smoking)

MHB Optimizing Linear Regression Cost Function

Thread 'Problem with calculating projections of curl using rotation of contour'

Similar threads

I Problem in understanding instantaneous velocity

I How to find the path if we only know the velocity (without common formulas)?

I Unit Circle Confusion: A Self-Study Challenge?

A How to Find Critical Points of function f(x,y,z)

A Getting the power spectral density from a plot

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers