Chemical Compounds Grammar

14 March, 2012

Following the MOLTO 4th Project Meeting in Zurich, I am making available my findings so far with regards to the Chemical Compounds grammar to facilitate a "handover" to the UHEL team.

I have put all my code fragments in the MOLTO SVN repository under the chemical-compounds folder.

Cooke-Fox et al.

D. I. Cooke-Fox, G. H. Kirby, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 1. Introduction and background to a grammar-based approach,” Journal of Chemical Information and Modeling, vol. 29, no. 2, pp. 101-105, 1989.
D. I. Cooke-Fox, G. H. Kirby, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 2. Development of a formal grammar,” Journal of Chemical Information and Modeling, vol. 29, no. 2, pp. 106-112, 1989.
D. I. Cooke-Fox, G. H. Kirby, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 3. Syntax analysis and semantic processing,” Journal of Chemical Information and Modeling, vol. 29, no. 2, pp. 112-118, 1989.
D. I. Cooke-Fox, G. H. Kirby, M. R. Lord, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 4. Concise connection tables to structure diagrams,” Journal of Chemical Information and Modeling, vol. 30, no. 2, pp. 122-127, 1990.
D. I. Cooke-Fox, G. H. Kirby, M. R. Lord, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 5. Steroid nomenclature,” Journal of Chemical Information and Modeling, vol. 30, no. 2, pp. 128-132, 1990.
G. H. Kirby, M. R. Lord, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 6. (Semi)automatic name correction,” Journal of Chemical Information and Modeling, vol. 31, no. 1, pp. 153-160, 1991.

A series of papers that specifically look at grammar-based translation of chemical names. In the second paper they mention that the grammar they wrote should be attached to the paper, unfortunately, no matter how much I looked I could not find a version of this grammar. I have transcribed some of the grammar in the paper into chemical-compounds/Cooke-Fox/grammar.cf, however we really need to get our hands on the original, full grammar.

Name=Struct

J. Brecher, “Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical Nomenclature,” Journal of Chemical Information and Modeling, vol. 39, no. 6, pp. 943-950, 1999.

Another chemical name parser which tries to be robust. Seems relevant but could not find any hard details about the system, I assume it is not open-source.

AUTONOM

J. L. Wisniewski, “AUTONOM: system for computer translation of structural diagrams into IUPAC-compatible names. 1. General design,” Journal of Chemical Information and Modeling, vol. 30, no. 3, pp. 324-332, Aug. 1990.
L. Goebels, a J. Lawson, and J. L. Wisniewski, “AUTONOM: system for computer translation of structural diagrams into IUPAC-compatible names. 2. Nomenclature of chains and rings,” Journal of Chemical Information and Modeling, vol. 31, no. 2, pp. 216-225, May. 1991.

CHEMorph

G. Kremer, S. Anstein, and U. Reyle, “Analysing and Classifying Names of Chemical Compounds with CHEMorph,” in Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM 2006), 2006.

Parser of IUPAC names into SMILES, which is a sort of canonical representation form e.g. "7-hydroxyheptan-2-one" becomes OCCCCCC(O)C

The SMILES spec is available, for example here: http://www.opensmiles.org/spec/open-smiles-2-grammar.html#2.2 I wrote a small CFG based on the grammar at that URL, in the chemical-compounds/SMILES folder.

Lexichem

R. Sayle, “Foreign Language Translation of Chemical Nomenclature by Computer,” Journal of Chemical Information and Modeling, vol. 49, no. 3, pp. 519-530, 2009.

This project does translation of compound names based on string replacement as opposed to grammars. Very relevant, but it is NOT open source. Although based on the website it may be possible to get an academic license to use it for free.

OPSIN

D. M. Lowe, P. T. Corbett, P. Murray-Rust, and R. C. Glen, “Chemical name to structure: OPSIN, an open source solution.,” Journal of chemical information and modeling, vol. 51, no. 3, pp. 739-53, Mar. 2011.

This is a full parser for IUPAC nomenclature which is open source. There is a demo here: http://opsin.ch.cam.ac.uk/

The system is written in Java, and the parser itself is essentially build out of a massive regular expression, which is split into hundreds of pieces in various XML files (see chemical-compounds/OPSIN XML). I have made a small attempt at converting these regular expressions into a GF grammar using dependant types, the code of which is in chemical-comounds/OPSIN-based. While this method may work with a lot more work, I suspect the granularity of it is actually far finer than we actually need to perform translation of chemical names. Looking at the source, I think you will see what I mean.