Chemical Compounds Grammar

14 March, 2012

Following the MOLTO 4th Project Meeting in Zurich, I am making available my findings so far with regards to the Chemical Compounds grammar to facilitate a "handover" to the UHEL team.

I have put all my code fragments in the MOLTO SVN repository under the chemical-compounds folder.

Cooke-Fox et al.

  • D. I. Cooke-Fox, G. H. Kirby, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 1. Introduction and background to a grammar-based approach,” Journal of Chemical Information and Modeling, vol. 29, no. 2, pp. 101-105, 1989.
  • D. I. Cooke-Fox, G. H. Kirby, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 2. Development of a formal grammar,” Journal of Chemical Information and Modeling, vol. 29, no. 2, pp. 106-112, 1989.
  • D. I. Cooke-Fox, G. H. Kirby, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 3. Syntax analysis and semantic processing,” Journal of Chemical Information and Modeling, vol. 29, no. 2, pp. 112-118, 1989.
  • D. I. Cooke-Fox, G. H. Kirby, M. R. Lord, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 4. Concise connection tables to structure diagrams,” Journal of Chemical Information and Modeling, vol. 30, no. 2, pp. 122-127, 1990.
  • D. I. Cooke-Fox, G. H. Kirby, M. R. Lord, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 5. Steroid nomenclature,” Journal of Chemical Information and Modeling, vol. 30, no. 2, pp. 128-132, 1990.
  • G. H. Kirby, M. R. Lord, and J. D. Rayner, “Computer translation of IUPAC systematic organic chemical nomenclature. 6. (Semi)automatic name correction,” Journal of Chemical Information and Modeling, vol. 31, no. 1, pp. 153-160, 1991.

A series of papers that specifically look at grammar-based translation of chemical names. In the second paper they mention that the grammar they wrote should be attached to the paper, unfortunately, no matter how much I looked I could not find a version of this grammar. I have transcribed some of the grammar in the paper into chemical-compounds/Cooke-Fox/grammar.cf, however we really need to get our hands on the original, full grammar.

Name=Struct

  • J. Brecher, “Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical Nomenclature,” Journal of Chemical Information and Modeling, vol. 39, no. 6, pp. 943-950, 1999.

Another chemical name parser which tries to be robust. Seems relevant but could not find any hard details about the system, I assume it is not open-source.

AUTONOM

  • J. L. Wisniewski, “AUTONOM: system for computer translation of structural diagrams into IUPAC-compatible names. 1. General design,” Journal of Chemical Information and Modeling, vol. 30, no. 3, pp. 324-332, Aug. 1990.
  • L. Goebels, a J. Lawson, and J. L. Wisniewski, “AUTONOM: system for computer translation of structural diagrams into IUPAC-compatible names. 2. Nomenclature of chains and rings,” Journal of Chemical Information and Modeling, vol. 31, no. 2, pp. 216-225, May. 1991.

CHEMorph

  • G. Kremer, S. Anstein, and U. Reyle, “Analysing and Classifying Names of Chemical Compounds with CHEMorph,” in Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM 2006), 2006.

Parser of IUPAC names into SMILES, which is a sort of canonical representation form e.g. "7-hydroxyheptan-2-one" becomes OCCCCCC(O)C

The SMILES spec is available, for example here: http://www.opensmiles.org/spec/open-smiles-2-grammar.html#2.2 I wrote a small CFG based on the grammar at that URL, in the chemical-compounds/SMILES folder.

Lexichem

  • R. Sayle, “Foreign Language Translation of Chemical Nomenclature by Computer,” Journal of Chemical Information and Modeling, vol. 49, no. 3, pp. 519-530, 2009.

This project does translation of compound names based on string replacement as opposed to grammars. Very relevant, but it is NOT open source. Although based on the website it may be possible to get an academic license to use it for free.

OPSIN

  • D. M. Lowe, P. T. Corbett, P. Murray-Rust, and R. C. Glen, “Chemical name to structure: OPSIN, an open source solution.,” Journal of chemical information and modeling, vol. 51, no. 3, pp. 739-53, Mar. 2011.

This is a full parser for IUPAC nomenclature which is open source. There is a demo here: http://opsin.ch.cam.ac.uk/

The system is written in Java, and the parser itself is essentially build out of a massive regular expression, which is split into hundreds of pieces in various XML files (see chemical-compounds/OPSIN XML). I have made a small attempt at converting these regular expressions into a GF grammar using dependant types, the code of which is in chemical-comounds/OPSIN-based. While this method may work with a lot more work, I suspect the granularity of it is actually far finer than we actually need to perform translation of chemical names. Looking at the source, I think you will see what I mean.

Comments

CML work

Please let us also check the work done by Dr Peter Murray-Rust in a chemical markup language:

As a personal experience, working with the representation of the semantics of a mathematical formula -- inherently a tree structure, we are still dealing with issues in going from the formula (at best a 2D representation) to its semantics. Now consider the general task which is asked for here. We are asked to infer from a linear representation, in some language, a chemical molecule, namely a 3D structure (a graph, with cycles, with properties of its nodes etc etc etc). What do we need to know in order to do this properly? Can we do it in the time we have left in the project? --- BTW, this is good news for research :) ---

In mathematics we do not yet know how to get the tree (or the semantics of the formula) from the presentation, say in LateX. Probably we can do it for well known domains, e.g. linear algebra, where conventions are established since long time in the literature. I do not know of any work in this but I would love to see GF and SMT applied to the body of abstracts in a mathematical subject to get a semantic representation for them. (I might even do this if I find time in between report writing). BTW, we started to work on this problem in 1985 and we are not there yet.