Friday, February 19, 2010

Python PEP for a graph API

I just stumbled across this
http://wiki.python.org/moin/PythonGraphApi. I think it's great. The
discussion has actually been going on since Aug 2004, so I don't know
what the status of this PEP is. I would love to see something come out
of it eventually.

tags: graph, PEP, python

Wednesday, February 10, 2010

Probabilistic language models, auto-correction tools and scientific discovery.

Probabilistic language models, auto-correction tools and scientific discovery.

"Durgesh Kumar Dwivedi":http://network.nature.com/people/U56CB3E51/profile over on Nature Network just asked "Does anyone have any software or web address which corrects English grammar, preposition, edit and shortened the paragraphs?". This question brought to mind and idea that I had a few years ago.

The idea is simple enough, use a large corpus of pre-vetted grammatically correct text as a training tool to compare sentences against. If you have enough example sentences, then every occurrence of every word in a given sentence will have a certain likelihood of occurring. Errors, and new word formulations will have low probabilities of occurring. Compare a manuscript that is being prepared for submission against the corpus and the machine should be able to point out the parts that may be either wrong or novel. Some kind of a Bayseian model would seem to be appropriate.

Now for natural language it is probably the case that there are not enough overlaps of complete sentences (though there may well be of phrases). However if you look at the academic literature then the scope of language used is very much reduced. The scientific literature in particular adopts an inbred subset of the English language, it's very own ghetto. One could image, for instance, taking all of the content of all articles published by Nature over the past 30 years, and use this as the control corpus. The person submitting a manuscript would get, on return of submission, a markup of where in their text there may be errors, with in addition perhaps, the most common forms of sentences that are found in their place.

I don't imagine that such a service would come into existence any time soon, but I think it would be cool. One could also use something like this to automatically recommend references or related papers. The "Journal Author Name Estimator":http://www.biosemantics.org/jane/ already does something like this for abstracts.

There is a wealth of research on probabilistic language models (see below), but I don't think anyone has tried out the idea proposed here.

It came to me after a few years working in a copy editing department of a scientific publisher. Again and again we would see the same kinds of corrections happening, and it just seems like an area ripe for automation.

"Using a probabilistic translation model for cross-language information retrieval":http://eprints.kfupm.edu.sa/74398/

"Language Analysis and Understanding":http://cslu.cse.ogi.edu/HLTsurvey/ch3node2.html

"A Parallel Training Algorithm for Hierarchical Pitman-Yor Process Language Models":http://www.cstr.ed.ac.uk/downloads/publications/2009/sh_interspeech09.pdf

A Bayesian network coding scheme for annotating biomedical information presented to genetic counseling clients "doi:10.1016/j.jbi.2004.10.001":http://dx.doi.org/10.1016/j.jbi.2004.10.001

"Phrase-Based Statistical Language Modeling from Bilingual Parallel Corpus":http://www.springerlink.com/content/b4ujx41571p47082/

"Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors":http://videolectures.net/icml08_wallach_bmd/

"Using language models for tracking events of interest over time":http://boston.lti.cs.cmu.edu/callan/Workshops/lmir01/WorkshopProcs/Papers/mspitters.pdf

Thursday, February 4, 2010

Cost benefit of publishing academic books,

Nature has just published an editorial (http://www.nature.com/nature/journal/v463/n7281/full/463588a.html) promoting the idea of writing text books (http://dx.doi.org/10.1038/463588a).
Having worked for a number of years as a commissioning editor for a
major academic publisher, responsible for more that 20% of academic
book output, I have to say that the cost-benifit analysis for
publishing books ends up saying that it just does not justify the
effort for academics to write monographs. You probably won't see more
than a few hundred sales, citations to the work will be slow in
coming. Early career academics in particular are wasting valuable time
that should be spent on getting publications out. That said there are
three situations in which it might be OK to be involved in producing a
book:

1. You are at an advanced stage in your career and you want to codify
your vision of a particular subject. In such a case the work is a
labour of love, you have your laurels and now you want to produce an
artefact that synthesises your view on a topic. This is a highly
valuable exercise, look at the works of Chandrasekhar for an extreme
example of this. Of course, such an individual is going to go ahead
and do this anyway.

2. You have been instructing a class and have put together a detailed
set of instructional notes, especially for advanced classes in
graduate school. For a little more effort you can convert a large
batch of work that you have already done into another artefact that
can increase your academic reputation, go for it!

3. You are involved in a large consortium or working group. The act of
putting together a chapter for a book can cement working
relationships. What tends to be more important here though is the
collaborative process, more-so than the final artefact. The question
to be asked should be whether working with the given group of
academics is worth the time, rather than whether the final book will
be worth the time involved.