Sparsity of Support vector machines over an RKHS

eipiplusone · Jan 18, 2017

Im trying to solve the following problem from the book 'Learning with kernels', and would really appreciate a little help.

Background information

- Let $\{(x_{1},y_{1}),...,(x_{N},y_{N})\}$ be a dataset, L a Loss function and $H(k)$ a reproducing kernel Hilbert space with kernel $k$. The representer theorem states that the minimizer $f \in H(k)$ of the following optimization problem
\begin{equation}
\operatorname*{argmin}_{f \in H(k)} \sum_{i=1}^{N}L(y_{i},f(x_{i})) + \Vert f \Vert_{H(k)}
\end{equation}
can be represented as
\begin{equation}
f(\cdot) = \sum_{i=1}^{N}\alpha_{i}k(x_{i},\cdot)
\end{equation}Problem statement

- Show that it is a sufficient requirement for the coefficients $\alpha_{i}$ of the kernel expansion to vanish, if for the corresponding loss
functions $L(y_{i},f(x_{i}))$ both the lhs and the rhs derivative with respect to $f(x_{i})$ vanish. Hint: use the proof strategy of the Representer theorem (see https://en.wikipedia.org/wiki/Representer_theorem).
My sparse thoughts

- I have been pondering on this for half a day, but don't seem to get anywhere. I don't exactly know how to make sense of the derivative of L. Since we want to minimize over $H(k)$ (a set of functions), I don't see how it is relevant to take the derivative (and thus minimize) w.r.t. $f(x_{i})$ (which is a real number). Furthermore, this derivative-approach seemingly has nothing to do with the proof strategy from the Representer theorem; in this proof, $H(k)$ is decomposed into the span of $k(x_{i})$ and its orthogonal complement. It is then shown that the component of $f \in H(k)$ that lies in the ortogonal complement vanish when evaluated in the $x_{i}$'s, and thus does not contribute to the Loss-function term of the minimization problem.

Ps. I am hoping to be able to use this result to show that the solutions to the SVM optimization problem only depends on the 'support vectors'.

Any help would be greatly appreciated.

mighty2000 · Jan 18, 2017

Thank you for reaching out for help with this problem from the book "Learning with Kernels." I am happy to assist you in understanding the proof and how it relates to the Representer theorem.

To begin, let's break down the problem statement. We are given a dataset $\{(x_{1},y_{1}),...,(x_{N},y_{N})\}$ and we want to minimize the following optimization problem:

\begin{equation}
\operatorname*{argmin}_{f \in H(k)} \sum_{i=1}^{N}L(y_{i},f(x_{i})) + \Vert f \Vert_{H(k)}
\end{equation}

The Representer theorem tells us that the minimizer $f \in H(k)$ can be represented as a linear combination of the kernel $k(x_{i},\cdot)$ with coefficients $\alpha_{i}$:

\begin{equation}
f(\cdot) = \sum_{i=1}^{N}\alpha_{i}k(x_{i},\cdot)
\end{equation}

Now, the problem statement asks us to show that if the coefficients $\alpha_{i}$ vanish, then the derivatives of the corresponding loss functions $L(y_{i},f(x_{i}))$ also vanish. In other words, if we have a solution $f$ that is a linear combination of the kernel $k(x_{i},\cdot)$ with coefficients $\alpha_{i}$ that are all equal to zero, then the derivatives of the loss functions at the points $x_{i}$ also equal zero.

To understand why this is a sufficient requirement, we need to go back to the proof strategy of the Representer theorem. The key idea is that we can decompose $H(k)$ into two orthogonal subspaces: the span of $k(x_{i},\cdot)$ and its orthogonal complement. This means that any function $f \in H(k)$ can be written as a sum of two components: one that lies in the span of $k(x_{i},\cdot)$ and one that lies in the orthogonal complement.

Now, when we evaluate the function $f$ at the points $x_{i}$, we can see that the component that lies in the orthogonal complement will vanish, since it is orthogonal to the span of $k(x_{i},\cdot)$. This means that when we

Sparsity of Support vector machines over an RKHS

Attachments

1. What is sparsity in the context of support vector machines (SVMs)?

2. How does sparsity affect the performance of SVMs?

3. What is the relationship between sparsity and the Reproducing Kernel Hilbert Space (RKHS)?

4. How is sparsity achieved in SVMs?

5. Are there any drawbacks to sparsity in SVMs?

Similar threads

Hot Threads

Recent Insights