Directed Graph

Thursday, April 29, 2010

A mini rant about LaTeX


Here is a comment I've just posted to a comment thread about LaTeX over here: http://blogs.nature.com/farhat/2010/04/16/collaborative-editing-with-latex. One of the commenters asked "Why not just use a word processor"


Just using a word processor simply does not cut it when it comes to expressing complex mathematical ideas. LaTeX, and it's precursor TeX, have evolved to be the only tools that can currently express the richness of the ideas within advanced mathematics. Whether this is a good thing or not is moot, it's simply the case, so that's one reason to use LaTeX.


A huge advantage of using a text only tool is keeping different versions of a document, and diffing between versions, is much easier than the horror that is "Track Changes" on most editors. Many groups I know keep their papers as LaTaX files in a code repository and use a system such as SVN for collaboration. This is not a benefit of LaTeX specifically, and is way beyond what most people would do with it, however the need for this kind of ability with the objects upon which we collaborate is painfully clear, and is only starting to be address in the "rich document" space with services such as Google Wave and Google Docs.


Ideally one would like to be able to write one's equations directly into the computer. For the time being LaTeX provides an expressive syntax from the keyboard that translates into mathematics on the screen. The iPad may well change this, but I suspect that any solution that takes handwriting and converts it to marked up mathematics will be built on top of the excellent codebase that is the LaTeX framework. This would be an excellent example of the DRY formalism (don't repeat yourself), build it on top of that which already works. For an example of a web app that does this already have a look at:
http://detexify.kirelabs.org/classify.html.


Richard is right that not separating the semantics from the presentation is a weakness. Sadly one issue is that a single equation or symbol can actually mean different things. The closest tool there is to encapsulate the semantics of mathematics is the semantic version of MathML. No one writes this by hand. The best tools for producing semantic MathML (as opposed to the presentation version of MathML), do so by translating from human input in the form of, yes you guessed it, LaTeX.


In spite of being semantically dumb there are a couple of interesting web services that build on top of the fact that there is a huge community of people out there who speak LaTeX. http://www.mathtran.org/formulas/ allows people to share formulas. http://www.latexsearch.com/ allows you to search through Springer's archive of mathematics. Springer has the most extensive archive of mathematical literature in the world, so that is no mean feat.


The last thing I will say is that LaTeX is just awesome. It produces beautiful documents. It is akin to learning a programming language, but then, if you are doing serious mathematics, it's just another symbol set, it's not that hard, really. I guess if you are doing something soft and not so well defined, then a word processor is going to be good enough for you.

Friday, February 19, 2010

Python PEP for a graph API

I just stumbled across this
http://wiki.python.org/moin/PythonGraphApi. I think it's great. The
discussion has actually been going on since Aug 2004, so I don't know
what the status of this PEP is. I would love to see something come out
of it eventually.

tags: graph, PEP, python

Wednesday, February 10, 2010

Probabilistic language models, auto-correction tools and scientific discovery.

Probabilistic language models, auto-correction tools and scientific discovery.

"Durgesh Kumar Dwivedi":http://network.nature.com/people/U56CB3E51/profile over on Nature Network just asked "Does anyone have any software or web address which corrects English grammar, preposition, edit and shortened the paragraphs?". This question brought to mind and idea that I had a few years ago.

The idea is simple enough, use a large corpus of pre-vetted grammatically correct text as a training tool to compare sentences against. If you have enough example sentences, then every occurrence of every word in a given sentence will have a certain likelihood of occurring. Errors, and new word formulations will have low probabilities of occurring. Compare a manuscript that is being prepared for submission against the corpus and the machine should be able to point out the parts that may be either wrong or novel. Some kind of a Bayseian model would seem to be appropriate.

Now for natural language it is probably the case that there are not enough overlaps of complete sentences (though there may well be of phrases). However if you look at the academic literature then the scope of language used is very much reduced. The scientific literature in particular adopts an inbred subset of the English language, it's very own ghetto. One could image, for instance, taking all of the content of all articles published by Nature over the past 30 years, and use this as the control corpus. The person submitting a manuscript would get, on return of submission, a markup of where in their text there may be errors, with in addition perhaps, the most common forms of sentences that are found in their place.

I don't imagine that such a service would come into existence any time soon, but I think it would be cool. One could also use something like this to automatically recommend references or related papers. The "Journal Author Name Estimator":http://www.biosemantics.org/jane/ already does something like this for abstracts.

There is a wealth of research on probabilistic language models (see below), but I don't think anyone has tried out the idea proposed here.

It came to me after a few years working in a copy editing department of a scientific publisher. Again and again we would see the same kinds of corrections happening, and it just seems like an area ripe for automation.

"Using a probabilistic translation model for cross-language information retrieval":http://eprints.kfupm.edu.sa/74398/

"Language Analysis and Understanding":http://cslu.cse.ogi.edu/HLTsurvey/ch3node2.html

"A Parallel Training Algorithm for Hierarchical Pitman-Yor Process Language Models":http://www.cstr.ed.ac.uk/downloads/publications/2009/sh_interspeech09.pdf

A Bayesian network coding scheme for annotating biomedical information presented to genetic counseling clients "doi:10.1016/j.jbi.2004.10.001":http://dx.doi.org/10.1016/j.jbi.2004.10.001

"Phrase-Based Statistical Language Modeling from Bilingual Parallel Corpus":http://www.springerlink.com/content/b4ujx41571p47082/

"Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors":http://videolectures.net/icml08_wallach_bmd/

"Using language models for tracking events of interest over time":http://boston.lti.cs.cmu.edu/callan/Workshops/lmir01/WorkshopProcs/Papers/mspitters.pdf

Thursday, February 4, 2010

Cost benefit of publishing academic books,

Nature has just published an editorial (http://www.nature.com/nature/journal/v463/n7281/full/463588a.html) promoting the idea of writing text books (http://dx.doi.org/10.1038/463588a).
Having worked for a number of years as a commissioning editor for a
major academic publisher, responsible for more that 20% of academic
book output, I have to say that the cost-benifit analysis for
publishing books ends up saying that it just does not justify the
effort for academics to write monographs. You probably won't see more
than a few hundred sales, citations to the work will be slow in
coming. Early career academics in particular are wasting valuable time
that should be spent on getting publications out. That said there are
three situations in which it might be OK to be involved in producing a
book:

1. You are at an advanced stage in your career and you want to codify
your vision of a particular subject. In such a case the work is a
labour of love, you have your laurels and now you want to produce an
artefact that synthesises your view on a topic. This is a highly
valuable exercise, look at the works of Chandrasekhar for an extreme
example of this. Of course, such an individual is going to go ahead
and do this anyway.

2. You have been instructing a class and have put together a detailed
set of instructional notes, especially for advanced classes in
graduate school. For a little more effort you can convert a large
batch of work that you have already done into another artefact that
can increase your academic reputation, go for it!

3. You are involved in a large consortium or working group. The act of
putting together a chapter for a book can cement working
relationships. What tends to be more important here though is the
collaborative process, more-so than the final artefact. The question
to be asked should be whether working with the given group of
academics is worth the time, rather than whether the final book will
be worth the time involved.

Tuesday, April 14, 2009

Neat blog on Sankey diagrams.

I just noticed a very nice blog dedicated to Sankey diagrams. As Phineas mentions in one post: Sankey diagrams are directed graphs.

notes on "Classes of complex networks defined by role-to-role connectivity profiles"

Looking for algorithms to do community detection (there are a bunch from the famous Girvan-Newman algorithm doi:10.1073/pnas.122653799 which seems to have gained popularity through being very simple, effiencet and usually pretty powerfull, through to a method doi:10.1103/PhysRevE.71.046117 by Ziv, Middendorf and Wiggins. I have to admit I don't understand that paper yet, but hope to come back to it.) I was pointed to a paper by Guimera, Sales-Pardo and Amaral. In looking for the paper the first one that I found was Classes of complex networks defined by role-to-role connectivity profiles on DOI:10.1038/nphys489. It turns out that they describe their algorithm in the doi:10.1038/nature03288, but as I encountered the nautre physics aritcle I may as well read my way through that and give you the skinny on it.

The main story in this paper is that real networks are complicated, and that the properties of modules in the network are not reflected by the average properites of the entire network. The implication would seem to be that models for generating netowrks do note capture importnat properties of real world netowrks. That's not really a great surprise, however I guess that it is nice that someone has sone an explicit comparison.

One interesting idea in the paper is that they try to classify modules according to how the average connectivity of the connecting nodes in the modules (the nodes that bridge out to other modules) is distributed among the other modules. The classes they describe are non-hubs, with non-hubs being further described into ultra-peripheral nodes, peripheral nodes, satellite connectors and kinless nodes. Hubs get broken out into connector hubs, global hubs and provincial hubs.

These descriptions are tightly defined in terms of properties of the identified modules. The claim in the paper is that global properties of real networks, specifically some degree-degreee correlations, can only be reconstructed by taking into account this modular structure. The new property is called the "Participation Function".

There is another imporant aspect to this picture. If we think of the modules as building blocks, this paper is saying that the kinds of building block of the network are key to understanding the properties of the network. In addition, how those building blocks tend to connect to each other is an equally important aspect of this picture and can be measured by looking at the likelihood of a connection between two blocks in a given network in contrast to a random connection probablility.

An illuminating example is given by describing the differences between Cincinnati airport and Johannesburg airport. The two have the same degree distribution, but you can fly to the latter from most major airports in the world and not the former. This is described by noting that Joberg is the most connected city in it's region, but the same does not hold true for Cincinnati (trivia, this city was named after a roman dictator).

This is mainly a descriptive paper, but the authors suggest that the differences may arise in real world networks depending on whether they are transportation networks that are constrained by conservation laws, as opposed to signialling networks that face no such constrainsts. (I'm not sure, without thinking about it further, that you wouldn't be able to define some kind of a conservation law on a signalling network, given that there will tend to be some constrainst, e.g. attention time/information processing capacity of the reciever nodes).

It seems from the paper that they use simulated annealing to determing modularity.




The math bit

The modularity of a partition $\mathcal{M} \left( \mathcal{P} \right)$ is defined as

\mathcal{M} \left( \mathcal{P} \right) = \sum_{s=1}^{N_{M}} \left[ \frac{l_{s}}{L} - \left( \frac{d_{s}}{2L} \right)^{2} \right]


Where $L$ is the number of links in the network, $N_{M}$ is the number of non-empty modules in the partition, $l_{s}$ is the number of links in a module $s$ and $d_{s}$ is the sum of degrees in a module $s$.

Pick your favorite algorithm (I don't have one yet!) to find a partition to your liking.

Modules roles are determined by looking at where the modules fall in the $z$ - $P$ plane where $z$ is the relative within module degree and $P$ is the participation coefficient. I'll have a look at these two metrics in a later post.

Monday, April 13, 2009

Why I'm setting up this blog

Right, well, I'm pretty interested in graph theory. I work in publishing for the journal Nature on new web products, and I need a focussed way of getting my thoughts together on a range of topics that touch on science. I wanted a space that was more focussed than my personal blog.

I have a few specific aims for this blog.

  • Get some of my thoughts on publishing, science and the philosophy of science out of my head and into an archived public space

  • Make notes on any academic papers that I read that touch on my area of interests

  • Motivate me to take all the cruddy code that I have lying around and push it public

  • I'd like to use this blog as a driver to learning how to visualize and extract sense out of large graphs


  • I'm hoping to perhaps get to grips to Processing as a means for the last of these three.

    In regards to the first of these three, I do intend to draw a dotted line between my personal opinions and bits of science that I find interesting, and projects that I am involved in at work. I like the tao of mac disclaimer, and I hope to tend more towards following this advice than not, however it is not clear to me that I would wish to refrain from commenting on aspects of the work that I do, especially those that reside very much in the public domain.

    That said, all opinions voiced here are strictly my own.