Chemical Compounds Grammar

14 March, 2012

Following the MOLTO 4th Project Meeting in Zurich, I am making available my findings so far with regards to the Chemical Compounds grammar to facilitate a "handover" to the UHEL team.

I have put all my code fragments in the MOLTO SVN repository under the chemical-compounds folder.

Cooke-Fox et al.

A series of papers that specifically look at grammar-based translation of chemical names. In the second paper they mention that the grammar they wrote should be attached to the paper, unfortunately, no matter how much I looked I could not find a version of this grammar. I have transcribed some of the grammar in the paper into chemical-compounds/Cooke-Fox/grammar.cf, however we really need to get our hands on the original, full grammar.

Name=Struct

Another chemical name parser which tries to be robust. Seems relevant but could not find any hard details about the system, I assume it is not open-source.

AUTONOM

CHEMorph

Parser of IUPAC names into SMILES, which is a sort of canonical representation form e.g. "7-hydroxyheptan-2-one" becomes OCCCCCC(O)C

The SMILES spec is available, for example here: http://www.opensmiles.org/spec/open-smiles-2-grammar.html#2.2 I wrote a small CFG based on the grammar at that URL, in the chemical-compounds/SMILES folder.

Lexichem

This project does translation of compound names based on string replacement as opposed to grammars. Very relevant, but it is NOT open source. Although based on the website it may be possible to get an academic license to use it for free.

OPSIN

This is a full parser for IUPAC nomenclature which is open source. There is a demo here: http://opsin.ch.cam.ac.uk/

The system is written in Java, and the parser itself is essentially build out of a massive regular expression, which is split into hundreds of pieces in various XML files (see chemical-compounds/OPSIN XML). I have made a small attempt at converting these regular expressions into a GF grammar using dependant types, the code of which is in chemical-comounds/OPSIN-based. While this method may work with a lot more work, I suspect the granularity of it is actually far finer than we actually need to perform translation of chemical names. Looking at the source, I think you will see what I mean.