Tuesday, April 14, 2009

Neat blog on Sankey diagrams.

I just noticed a very nice blog dedicated to Sankey diagrams. As Phineas mentions in one post: Sankey diagrams are directed graphs.

notes on "Classes of complex networks defined by role-to-role connectivity profiles"

Looking for algorithms to do community detection (there are a bunch from the famous Girvan-Newman algorithm doi:10.1073/pnas.122653799 which seems to have gained popularity through being very simple, effiencet and usually pretty powerfull, through to a method doi:10.1103/PhysRevE.71.046117 by Ziv, Middendorf and Wiggins. I have to admit I don't understand that paper yet, but hope to come back to it.) I was pointed to a paper by Guimera, Sales-Pardo and Amaral. In looking for the paper the first one that I found was Classes of complex networks defined by role-to-role connectivity profiles on DOI:10.1038/nphys489. It turns out that they describe their algorithm in the doi:10.1038/nature03288, but as I encountered the nautre physics aritcle I may as well read my way through that and give you the skinny on it.

The main story in this paper is that real networks are complicated, and that the properties of modules in the network are not reflected by the average properites of the entire network. The implication would seem to be that models for generating netowrks do note capture importnat properties of real world netowrks. That's not really a great surprise, however I guess that it is nice that someone has sone an explicit comparison.

One interesting idea in the paper is that they try to classify modules according to how the average connectivity of the connecting nodes in the modules (the nodes that bridge out to other modules) is distributed among the other modules. The classes they describe are non-hubs, with non-hubs being further described into ultra-peripheral nodes, peripheral nodes, satellite connectors and kinless nodes. Hubs get broken out into connector hubs, global hubs and provincial hubs.

These descriptions are tightly defined in terms of properties of the identified modules. The claim in the paper is that global properties of real networks, specifically some degree-degreee correlations, can only be reconstructed by taking into account this modular structure. The new property is called the "Participation Function".

There is another imporant aspect to this picture. If we think of the modules as building blocks, this paper is saying that the kinds of building block of the network are key to understanding the properties of the network. In addition, how those building blocks tend to connect to each other is an equally important aspect of this picture and can be measured by looking at the likelihood of a connection between two blocks in a given network in contrast to a random connection probablility.

An illuminating example is given by describing the differences between Cincinnati airport and Johannesburg airport. The two have the same degree distribution, but you can fly to the latter from most major airports in the world and not the former. This is described by noting that Joberg is the most connected city in it's region, but the same does not hold true for Cincinnati (trivia, this city was named after a roman dictator).

This is mainly a descriptive paper, but the authors suggest that the differences may arise in real world networks depending on whether they are transportation networks that are constrained by conservation laws, as opposed to signialling networks that face no such constrainsts. (I'm not sure, without thinking about it further, that you wouldn't be able to define some kind of a conservation law on a signalling network, given that there will tend to be some constrainst, e.g. attention time/information processing capacity of the reciever nodes).

It seems from the paper that they use simulated annealing to determing modularity.




The math bit

The modularity of a partition $\mathcal{M} \left( \mathcal{P} \right)$ is defined as

\mathcal{M} \left( \mathcal{P} \right) = \sum_{s=1}^{N_{M}} \left[ \frac{l_{s}}{L} - \left( \frac{d_{s}}{2L} \right)^{2} \right]


Where $L$ is the number of links in the network, $N_{M}$ is the number of non-empty modules in the partition, $l_{s}$ is the number of links in a module $s$ and $d_{s}$ is the sum of degrees in a module $s$.

Pick your favorite algorithm (I don't have one yet!) to find a partition to your liking.

Modules roles are determined by looking at where the modules fall in the $z$ - $P$ plane where $z$ is the relative within module degree and $P$ is the participation coefficient. I'll have a look at these two metrics in a later post.

Monday, April 13, 2009

Why I'm setting up this blog

Right, well, I'm pretty interested in graph theory. I work in publishing for the journal Nature on new web products, and I need a focussed way of getting my thoughts together on a range of topics that touch on science. I wanted a space that was more focussed than my personal blog.

I have a few specific aims for this blog.

  • Get some of my thoughts on publishing, science and the philosophy of science out of my head and into an archived public space

  • Make notes on any academic papers that I read that touch on my area of interests

  • Motivate me to take all the cruddy code that I have lying around and push it public

  • I'd like to use this blog as a driver to learning how to visualize and extract sense out of large graphs


  • I'm hoping to perhaps get to grips to Processing as a means for the last of these three.

    In regards to the first of these three, I do intend to draw a dotted line between my personal opinions and bits of science that I find interesting, and projects that I am involved in at work. I like the tao of mac disclaimer, and I hope to tend more towards following this advice than not, however it is not clear to me that I would wish to refrain from commenting on aspects of the work that I do, especially those that reside very much in the public domain.

    That said, all opinions voiced here are strictly my own.

    Directed Graph blog setup.

    I've decided to use syntaxhiligher2 to do syntax hi-lighting and a call to Yourequations.com for rendering latex on the blog. Bot Cyborg has a good description of how to get this to work on blogger.

    The main advantage of this approach is that the pseudo code is written in the blog posts in plain encased in a pre statement, and the equations are also just vanilla latex in a pre statement. So if we wanted to think about the betweenness centrality of a graph we have the equation:


    C_B(v)= \sum_{s \neq v \neq t \in V \atop s \neq t}\frac{\sigma_{st}(v)}{\sigma_{st}}


    and in python for a given graph object we could use the excellent networkx package to calculate it like so


    import networkx as nx #import networkx
    G=nx.read_adjlist("graph.adjlist") # read in a graph graph
    nodes_betweenness_centrality = nx.betweenness_centrality(G) # returns a dictionary
    for node_betweenness_centrality in nodes_betweenness_centrality:
    print node_betweenness_centrality