Algorithm for cluster finding (?)

  • Thread starter Thread starter Borek
  • Start date Start date
  • Tags Tags
    Algorithm
Click For Summary

Discussion Overview

The discussion revolves around identifying clusters within a graph of web pages based on their interlinking structure. Participants explore various theories and methods related to clustering, including connected components and social network theory, while seeking terminology and approaches for their analysis.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant suggests that the concept of Connected Components might be relevant for identifying clusters in the graph of web pages.
  • Another participant mentions using social network theory and proposes running a clustering algorithm on the links while examining frequency and rank fall-off.
  • A different approach is introduced involving the Laplacian matrix, where the number of connected components corresponds to the multiplicity of the eigenvalue 0.
  • Concerns are raised about the interpretation of connectedness, questioning whether being connected necessarily indicates clustering.
  • One participant proposes the idea of setting thresholds for connections to refine the clustering analysis.

Areas of Agreement / Disagreement

Participants express various methods and theories for clustering, but there is no consensus on a single approach or terminology. The discussion remains unresolved regarding the best method to identify clusters.

Contextual Notes

Participants mention limitations in their understanding of the problem and the need for clearer definitions and goals in their analysis. There is also uncertainty about the implications of connectedness in determining clusters.

Borek
Mentor
Messages
29,180
Reaction score
4,612
OK, I admit I have no idea what is the correct terminology, so could be thread title is completely off.

Imagine I have a graph of links between www pages. I suppose some of these pages are clustered - that is, they are heavily interlinked, but they don't link so extensively to other clusters/pages. How do I find them? This is not different from finding groups of friends in some population by analysing phone calls and so on.

I am more than sure that I have seen pictures showing results of such analysis (perhaps in Scientific American), but I can't remember any details, I am even not sure about correct terminology. Fact that I have probably read about the problem in Polish doesn't help either.

Names, links, keywords to look for will be appreciated.
 
Technology news on Phys.org
Borek said:
How do I find them?.
Connected Components may be the theory you're looking for. Another option is social network theory.

I had a similar problem, but my transitions could be read off a single file. I looked at transitions between clusters by building a table of first order markov probabilities, then displayed the graphs using networkx, which is a python toolkit for graph visualization, and it includes some connected components stuff. (I'll give you the code if you want.)

I suppose some of these pages are clustered - that is, they are heavily interlinked, but they don't link so extensively to other clusters/pages.
What about running a clustering algorithm on the links and looking at the frequency and rank fall off?
 
Last edited:
Take each page as a vertex in a giant graph where edges indicate links between sites. Then take the laplacian matrix. The number of connected components is the multiplicity of the eigenvalue 0. http://en.wikipedia.org/wiki/Laplacian_matrix
 
story645 said:
Connected Components may be the theory you're looking for. Another option is social network theory.

Social network page gave me some ideas to think about. Trick is, I am not sure what I want to do :wink: I have just some vague ideas that I want to test.

I had a similar problem, but my transitions could be read off a single file.

I can prepare such file. In fact I am going to try to do it and to feed it to some of the free programs listed on the wiki social network analysis software page.

oab729 said:
Take each page as a vertex in a giant graph where edges indicate links between sites. Then take the laplacian matrix. The number of connected components is the multiplicity of the eigenvalue 0.

I am not sure how it is going to help - just because pages are connected doesn't mean they are in the cluster, just because they are not connected doesn't mean they are not in the cluster.
 
Hmm... perhaps I misunderstand your question? John knows Alice. Alice Knows Alex. John Doesn't know Alice. The number of connected components is 1. Or do you mean John needs to really know Alice, so could you set thresholds or something? You could also look up clustering analysis or Partition decoupling method, if that's what you're thinking...
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 8 ·
Replies
8
Views
4K
Replies
9
Views
3K
  • · Replies 46 ·
2
Replies
46
Views
7K
Replies
14
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 30 ·
2
Replies
30
Views
7K
  • · Replies 11 ·
Replies
11
Views
4K