Thursday, April 29, 2010

A mini rant about LaTeX


Here is a comment I've just posted to a comment thread about LaTeX over here: http://blogs.nature.com/farhat/2010/04/16/collaborative-editing-with-latex. One of the commenters asked "Why not just use a word processor"


Just using a word processor simply does not cut it when it comes to expressing complex mathematical ideas. LaTeX, and it's precursor TeX, have evolved to be the only tools that can currently express the richness of the ideas within advanced mathematics. Whether this is a good thing or not is moot, it's simply the case, so that's one reason to use LaTeX.


A huge advantage of using a text only tool is keeping different versions of a document, and diffing between versions, is much easier than the horror that is "Track Changes" on most editors. Many groups I know keep their papers as LaTaX files in a code repository and use a system such as SVN for collaboration. This is not a benefit of LaTeX specifically, and is way beyond what most people would do with it, however the need for this kind of ability with the objects upon which we collaborate is painfully clear, and is only starting to be address in the "rich document" space with services such as Google Wave and Google Docs.


Ideally one would like to be able to write one's equations directly into the computer. For the time being LaTeX provides an expressive syntax from the keyboard that translates into mathematics on the screen. The iPad may well change this, but I suspect that any solution that takes handwriting and converts it to marked up mathematics will be built on top of the excellent codebase that is the LaTeX framework. This would be an excellent example of the DRY formalism (don't repeat yourself), build it on top of that which already works. For an example of a web app that does this already have a look at:
http://detexify.kirelabs.org/classify.html.


Richard is right that not separating the semantics from the presentation is a weakness. Sadly one issue is that a single equation or symbol can actually mean different things. The closest tool there is to encapsulate the semantics of mathematics is the semantic version of MathML. No one writes this by hand. The best tools for producing semantic MathML (as opposed to the presentation version of MathML), do so by translating from human input in the form of, yes you guessed it, LaTeX.


In spite of being semantically dumb there are a couple of interesting web services that build on top of the fact that there is a huge community of people out there who speak LaTeX. http://www.mathtran.org/formulas/ allows people to share formulas. http://www.latexsearch.com/ allows you to search through Springer's archive of mathematics. Springer has the most extensive archive of mathematical literature in the world, so that is no mean feat.


The last thing I will say is that LaTeX is just awesome. It produces beautiful documents. It is akin to learning a programming language, but then, if you are doing serious mathematics, it's just another symbol set, it's not that hard, really. I guess if you are doing something soft and not so well defined, then a word processor is going to be good enough for you.

Friday, February 19, 2010

Python PEP for a graph API

I just stumbled across this
http://wiki.python.org/moin/PythonGraphApi. I think it's great. The
discussion has actually been going on since Aug 2004, so I don't know
what the status of this PEP is. I would love to see something come out
of it eventually.

tags: graph, PEP, python

Wednesday, February 10, 2010

Probabilistic language models, auto-correction tools and scientific discovery.

Probabilistic language models, auto-correction tools and scientific discovery.

"Durgesh Kumar Dwivedi":http://network.nature.com/people/U56CB3E51/profile over on Nature Network just asked "Does anyone have any software or web address which corrects English grammar, preposition, edit and shortened the paragraphs?". This question brought to mind and idea that I had a few years ago.

The idea is simple enough, use a large corpus of pre-vetted grammatically correct text as a training tool to compare sentences against. If you have enough example sentences, then every occurrence of every word in a given sentence will have a certain likelihood of occurring. Errors, and new word formulations will have low probabilities of occurring. Compare a manuscript that is being prepared for submission against the corpus and the machine should be able to point out the parts that may be either wrong or novel. Some kind of a Bayseian model would seem to be appropriate.

Now for natural language it is probably the case that there are not enough overlaps of complete sentences (though there may well be of phrases). However if you look at the academic literature then the scope of language used is very much reduced. The scientific literature in particular adopts an inbred subset of the English language, it's very own ghetto. One could image, for instance, taking all of the content of all articles published by Nature over the past 30 years, and use this as the control corpus. The person submitting a manuscript would get, on return of submission, a markup of where in their text there may be errors, with in addition perhaps, the most common forms of sentences that are found in their place.

I don't imagine that such a service would come into existence any time soon, but I think it would be cool. One could also use something like this to automatically recommend references or related papers. The "Journal Author Name Estimator":http://www.biosemantics.org/jane/ already does something like this for abstracts.

There is a wealth of research on probabilistic language models (see below), but I don't think anyone has tried out the idea proposed here.

It came to me after a few years working in a copy editing department of a scientific publisher. Again and again we would see the same kinds of corrections happening, and it just seems like an area ripe for automation.

"Using a probabilistic translation model for cross-language information retrieval":http://eprints.kfupm.edu.sa/74398/

"Language Analysis and Understanding":http://cslu.cse.ogi.edu/HLTsurvey/ch3node2.html

"A Parallel Training Algorithm for Hierarchical Pitman-Yor Process Language Models":http://www.cstr.ed.ac.uk/downloads/publications/2009/sh_interspeech09.pdf

A Bayesian network coding scheme for annotating biomedical information presented to genetic counseling clients "doi:10.1016/j.jbi.2004.10.001":http://dx.doi.org/10.1016/j.jbi.2004.10.001

"Phrase-Based Statistical Language Modeling from Bilingual Parallel Corpus":http://www.springerlink.com/content/b4ujx41571p47082/

"Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors":http://videolectures.net/icml08_wallach_bmd/

"Using language models for tracking events of interest over time":http://boston.lti.cs.cmu.edu/callan/Workshops/lmir01/WorkshopProcs/Papers/mspitters.pdf

Thursday, February 4, 2010

Cost benefit of publishing academic books,

Nature has just published an editorial (http://www.nature.com/nature/journal/v463/n7281/full/463588a.html) promoting the idea of writing text books (http://dx.doi.org/10.1038/463588a).
Having worked for a number of years as a commissioning editor for a
major academic publisher, responsible for more that 20% of academic
book output, I have to say that the cost-benifit analysis for
publishing books ends up saying that it just does not justify the
effort for academics to write monographs. You probably won't see more
than a few hundred sales, citations to the work will be slow in
coming. Early career academics in particular are wasting valuable time
that should be spent on getting publications out. That said there are
three situations in which it might be OK to be involved in producing a
book:

1. You are at an advanced stage in your career and you want to codify
your vision of a particular subject. In such a case the work is a
labour of love, you have your laurels and now you want to produce an
artefact that synthesises your view on a topic. This is a highly
valuable exercise, look at the works of Chandrasekhar for an extreme
example of this. Of course, such an individual is going to go ahead
and do this anyway.

2. You have been instructing a class and have put together a detailed
set of instructional notes, especially for advanced classes in
graduate school. For a little more effort you can convert a large
batch of work that you have already done into another artefact that
can increase your academic reputation, go for it!

3. You are involved in a large consortium or working group. The act of
putting together a chapter for a book can cement working
relationships. What tends to be more important here though is the
collaborative process, more-so than the final artefact. The question
to be asked should be whether working with the given group of
academics is worth the time, rather than whether the final book will
be worth the time involved.