### CWI report R 0010

Rights: IPR: CWI
DB ID: 4463
Digital object:http://oai.cwi.nl/oai/asset/4463/04463D.pdf (published version)
Persistent Identifier:urn:NBN:nl:ui:18-4463 (resolver: http://persistent-identifier.org/ - leads to CWI repository; use
http://persistent-identifier.org/?identifier=urn:nbn:nl:ui:18-4463 to refer to this repository item)
Open Access:A publicly accessible file is available in the CWI repository
Type: BibTeX type: techreport (report type: cwireport)
OAI type: preprint
Authors: Dongen, S. van (1)
Title: A cluster algorithm for graphs
Abstract: A cluster algorithm for graphs called the emph{Markov Cluster algorithm (MCL~algorithm) is introduced. The algorithm provides basically an interface to an algebraic process defined on stochastic matrices, called the MCL~process. The graphs may be both weighted (with nonnegative weight) and directed. Let~\$G\$~be such a graph. The MCL~algorithm simulates flow in \$G\$ by first identifying \$G\$ in a canonical way with a Markov graph \$G_1\$. Flow is then alternatingly expanded and contracted, leading to a row of Markov Graphs \$G_{(i)\$. Flow expansion corresponds with taking the~\$k^{th\$~power of a stochastic matrix, where~\$kinN\$. Flow contraction corresponds with a parametrized operator~\$Gamma_r\$, \$rgeq 0\$, which maps the set of (column) stochastic matrices onto itself. The image~\$Gamma_r M\$ is obtained by raising each entry in~\$M\$ to the~\$r^{th\$~power and rescaling each column to have sum~\$1\$ again. The heuristic underlying this approach is the expectation that flow between dense regions which are sparsely connected will evaporate. The invariant limits of the process are easily derived and in practice the process converges very fast to such a limit, the structure of which has a generic interpretation as an overlapping clustering of the graph~\$G\$. Overlap is limited to cases where the input graph has a symmetric structure inducing it. The contraction and expansion parameters of the MCL~process influence the granularity of the output. The algorithm is space and time efficient and lends itself to drastic scaling. This report describes the MCL~algorithm and process, convergence towards equilibrium states, interpretation of the states as clusterings, and implementation and scalability. The algorithm is introduced by first considering several related proposals towards graph clustering, of both combinatorial and probabilistic nature. Revised version of the report~[1]. A more mathematically oriented account on the MCL~process is given in~[2], establishing that under certain weak conditions the iterands of the MCL~process posses structure admitting a cluster interpretation. Various experiments conducted on a wide range of test-graphs are described in~[3]. The latter report also describes a generic graph clustering performance measure and a distance defined on the space of partitions. The work was carried out under project INS-3.2, Concept Building from Key-Phrases in Scientific Documents and Bottom Up Classification Methods in Mathematics. [1] A new cluster algorithm for graphs. Technical report INS-R9814, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, 1998. [2] A stochastic uncoupling process for graphs. Technical report INS-R0011, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, 2000. [3] Performance criteria for graph clustering and Markov cluster experiments. Technical report INS-R0012, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, 2000.
Language: en
MSC classification codes: 05B20;15B48;15A51;62H30;68R10;68T10;90C35
MSC classification description: Matrices (incidence, Hadamard, etc.); Positive matrices and their generalizations; cones of matrices; Stochastic matrices; Classification and discrimination; cluster analysis; Graph theory; Pattern recognition, speech recognition; Programming involving graphs or networks
Publisher: CWI
Size: 40p.
ISSN: 1386-3681
Series: Information Systems [INS]
Acronym: INS
Number: R 0010
Year: 2000
Pages: 1 - 40
Pages number: 40p.
ORA creation date: 2008-09-04 14:
ORA modification date: 20090918134109
Mutation date: 2009-09-18
Darenet: x
Related postprint:
Other relations: