How do we represent multidimensional data vectors on a 2D plot.

dexterdev · Aug 8, 2013

Hi all,
I would like to know if there exists any method to represent multidimensional vectors on a 2D plot so using extracting any unique features of those vectors. Can eigen values be used for such purposes like dimensional reduction. If so how ? I would like to know more about these things, any links or book suggestions will be helpful

-Devanand

chiro · Aug 9, 2013

Hey dexterdev.

The key question here is: What kind of features you are looking for?

Eigen-values give information about the spectrum of an operator and with the eigen-vectors give an idea of the basis and principal directions that scale existing vectors.

Do you want to for example, see which eigenvector corresponds with say a really small eigenvalue and remove that prinicipal basis vector to make a good enough "approximation" and reduce the dimension?

You could sort the eigen-vectors by eigen-value (in decreasing order) to do this kind of thing (dimension reduction) and depending on the magnitude it will tell you how much that particular vector makes a contribution.

dexterdev · Aug 10, 2013

Sir,
Basically I am interested in learning 'machine learning'. But still I am a beginner. I am trying to get an insight to eigen values and vectors etc. (Spectrum here you mentioned is entirely different from the frequency spectrum right, Sir?). When you say operator, I don't understand anything.
I also believe that Principal components are orthogonal. Is that right?
Can you suggest me some books related to this or links etc.

chiro · Aug 10, 2013

There are links between frequency spectrum and spectrum when you are talking about say solutions to linear ordinary differential equations. It really depends on what you are looking at and what you are using the results for.

Principal components are indeed orthogonal: what they actually do is they take a set of independent random vectors (at least that is what is assumed) and they "orthogonalize" the set of vectors by un-correlating them and the way they do is to "rotate" the vectors much in the same way you rotate a vector by multiplying it by a valid rotation matrix (matrix with determinant 1).

These components are sorted from the amount of contribution to variation (think variance) in a decreasing order, so the first set of vectors make large contributions to variation of the data set and the last have the lowest contribution.

This is why in PCA analysis, throwing out the last set of vectors is a way of using lower dimension, and also to check for some kind of linear dependence or "close" linear dependence (by checking the eigenvalues).

The area that you want to check out is Data Mining for more information on these topics. In terms of recommendations, you need to specify the approach you want to take.

Some books are purely practical in the sense that they give you a tool (like a library, independent piece of software, etc) and tell you how to use it through typed commands or a GUI interface.

Other books are more theoretical (data mining by its nature is practical) and they cover results and proofs in a way that gives understanding and is often targeted for researchers, academics, or professionals that have an interest in the theory and its extensions.

If you want the pure theory, I'd recommend you look at the mathematics or statistics books. PCA is a part of multivariate statistics and eigen-values and various matrix decompositions is a part of linear and multi-linear algebra.

dexterdev · Aug 11, 2013

Sir,
Basically what I am looking is for a practical approach. But also I would like to learn the theory behind it. I try to read IEEE papers etc, but when I see big equations I get lost. My problem basically is I have basics lacking. And I take long time in understanding concepts. I have a masters degree in electronics engineering (from India) , but machine learning is a new area for me. Also I am not an expert in my field. Only recently , I discovered that Signal processing and statistics etc are my dear subjects. I dream of doing a PhD in these areas like ML and signal processing.

chiro · Aug 11, 2013

From the statistical point of view, you should understand regression modelling and multivariate statistics.

If you have access to a university library, go to the statistics section and get books on those topics. You can use amazon or something similar to get feedback and ratings on the books, but most books should cover the same sorts of things.

There is a dedicated book on Principal Components:

https://www.amazon.com/dp/0387954422/?tag=pfamazon01-20

In terms of data mining, a practical book that covers the tools (not so much the theory) that makes use of open source freely available tools is this:

https://www.amazon.com/dp/1441998896/?tag=pfamazon01-20

Note that there are a lot of books like this, but since R and Rattle are free and open source, it means you can download it and play around with it straight away as opposed to something that costed money (and was expensive).

Also if you plan to do a lot of analysis in the future that uses some of statistics then learning R is a worthwhile investment since there are packages that do almost everything that you can do regarding statistical analysis.

If you don't have a good statistics background to start with, then I'd suggest you get one in some form.

There are a tonne of introduction statistics books including one like this:

https://www.amazon.com/dp/0321795431/?tag=pfamazon01-20

On top of statistics you will probably want to look at things like neural networks, decision trees, and various classification schemes like spatial classification and support vector machines.

Spatial classification looks at dividing space into disjoint regions and they involve things like parametric classification (spheres, ellipsoids, cuboids, etc that are specified using parameters) or non-parametric (k-dops, convex hulls, etc that are defined using general planes). For this you will need to understand geometry and linear algebra.

Also if you read basic research papers, you will need to know what integrals and derivatives are and what they mean in the context of your problem.

Also be aware that different researchers use different platforms. R is a multi-platform environment (linux, windows, mac) but some source codes might be written for linux or packages that are linux only. So if you have to use linux, windows, or a mac exclusively, be aware how to do so.

dexterdev · Aug 11, 2013

Thank you for the guidance and suggestions.
What about MATLAB / Octave as a tool in this area? I have prior experience with matlab.

chiro · Aug 11, 2013

There should be libraries out there for MATLAB (and possibly Octave) and they will range from open source libraries to commercial ones depending on what you need them for, how good they are performance wise, and what kind of functionality they have.

Just note though, that MATLABs core data structure is that of a matrix: it's not the only thing it does but it is the core structure and it is designed with this in mind.

If you are going to use statistical calculations and similar functionality, then R as a tool is good for this use.

Different tools have different advantages and disadvantages: MATLAB is a computational tool primarily for matrix problems and numerical calculations. R is primarily for statistical type calculations and output (including graphs).

In your work you will have to use multiple tools for the job and you should get used to this idea because using the output of one tool for the input of another is a common thing when the project is broad in its scale.

Be aware of what a tools tradeoffs are and what it should be used for (as well as what it should not).

dexterdev · Aug 11, 2013

thanks again sir...

How do we represent multidimensional data vectors on a 2D plot.

1. What is the purpose of representing multidimensional data vectors on a 2D plot?

2. How do we choose which dimensions to plot on the x and y axes?

3. Can we represent more than two dimensions on a 2D plot?

4. How do we ensure that the 2D plot accurately represents the data?

5. What are the limitations of representing multidimensional data on a 2D plot?

Similar threads

Hot Threads

Recent Insights