Tuesday 6 July 2010

Name2Structure and Back Again, A New Adventure

A recent presentation I found on line, titled "Chemical Information Mining: Possibilities and Pitfalls", attracted my attention. It was given by Neil Stutchbury on the 20th of May 2009 at the CICAG (Chemical Information and Computer Applications Group) Scientific Text and Data Mining conference in Burlington House London. The presentation covered a number of key points, firstly extracting chemical entities in text, secondly converting chemical names to structures, thirdly a user survey and finally points for future possible extensions.

Since I have now accepted a position at OpenEye, to primarily work on Lexichem, I am particularly interested in his discussion on name to structure. You're probably wondering what comes under the umbrella of name? Well there are a number of ever emerging types of names, too many to list in this blog post. However the ones currently under discussion are: trade names, common names, natural products, abbreviations, molecular formulae and IUPAC names. A trade name according to wikipedia, is "the name which a business trades under for commercial purposes", an example of usage would be "Aspirin"

which is dissimilar to its chemical name "acetylsalicylic acid". A common name, is a name in general use in a community, caffiene is a common name for 1,3,7-trimethyl-1H-purine- 2,6(3H,7H)-dione. Natural products are considered chemical compounds or substances produced by a living organism such as Taxol. In chemistry, a number of abbreviations are used to shorten the form of a word or phrase, examples include DMSO - dimethyl sulfoxide and THF - tetrahydrofuran. People in everyday life often talk about ordering H2O at the bar after a heavy night out, H20 is the formula for water.

IUPAC names are based on IUPAC nomenclature, a system of naming chemical compounds. It is developed and kept up to date by the International Union of Pure and Applied Chemistry (IUPAC). There are currently two books detailing rules for naming organic and inorganic compounds, which will be the subject of a future blog post.

When starting a new position, in any field, it is always interesting to know the competition. So who (both open and commerical) produces name to structure tools? Well, the (main) suppliers are shown in the table below:

University of Cambridge ->OSCAR
ChemAxon -> Chemicalize.org
InfoChem -> Annotator and ICN2S
ChemMantis -> SureChem (NER) with ACD/Name; ACD/NTS / Batch (N2S)
CambridgeSoft ->Name=Struct
OpenEye ->Lexichem TK
Accelrys ->ChemMining
TEMIS/MDL ->Chemical Entity Relationships Skill Cartridge
MPirics ->Chemical Content Recognition


OSCAR
Having first hand experience, coming from the University of Cambridge and Peter Murray-Rust's group, I can recommend OPSIN (the N2S module), which is open source code, and written in Java. Whilst OPSIN is showing progress, it is still in its infancy.

Chemicalize.org
A product of ChemAxon, is hosted online. They have used the jQuery javascript library to create a form that a user can paste in chemical names or upload a small (4KB) file to convert names to structure. It should also be noted however that ChemAxon have their own Java API to convert names to structure and structures to names.

A simple program to read in molecules, then convert the molecules to IUPAC and traditional names is shown.

package eoc21;

import java.io.IOException;

import chemaxon.formats.MolFormatException;
import chemaxon.formats.MolImporter;
import chemaxon.license.LicenseManager;
import chemaxon.license.LicenseProcessingException;
import chemaxon.marvin.calculations.IUPACNamingPlugin;
import chemaxon.marvin.plugin.PluginException;
import chemaxon.struc.Molecule;

/**
* Converts structure to name using ChemAxon's java API.
* @author ed
*
*/
public class Name2Structure {

public static void main(String[] args) throws PluginException, MolFormatException, IOException, LicenseProcessingException {
LicenseManager.setLicenseFile("license.cxl");
MolImporter mi = new MolImporter("150mols.smi");
Molecule mol;
IUPACNamingPlugin plugin = new IUPACNamingPlugin();
while((mol = mi.read()) != null){
plugin.setMolecule(mol);
plugin.run();
String preferredIUPACName = plugin.getPreferredIUPACName();
String traditionalName = plugin.getTraditionalName();
System.out.println(preferredIUPACName);
}
mi.close();
}
}


Annotator
This product can extract chemically relevant entities from text and convert the names to structure. It supports: systematic names, trivial names, trade names as well as inchis and CAS Registry Numbers.

ChemMantis
An integrated system for chemistry-based entity extraction and document mark-up enabling access to the rich resource of online chemistry know as ChemSpider.

Name=Struct
CambridgeSoft supports both name to structure and structure to name conversions. It offers support for charged compounds and salts, highly symmetric structures and many other types of inorganic and organometallics.

Lexichem TK
A C++ toolkit provided by OpenEye to convert names to structure, structure to names and has the ability to convert the names into different languages such as Japanese, Romanian and Hungarian. Example of name to structure code taken from the Lexichem documentation is given below:

***********************************************************************/
#include "openeye.h"
#include "oeplatform.h"
#include "oesystem.h"
#include "oechem.h"
#include "oeiupac.h"
#include "nam2mol_example.itf"
using namespace OEPlatform;
using namespace OESystem;
using namespace OEChem;
using namespace OEIUPAC;
using namespace std;
#ifndef STDIN_FILENO
#define STDIN_FILENO 0
#endif
int main(int argc, char *argv[])
{
OESetMemPoolMode(OEMemPoolMode::SingleThreaded|OEMemPoolMode::UnboundedCache);
OEThrow.Info("Lexichem nam2mol example");
OEThrow.Info(" Lexichem version: %s", OEIUPACGetRelease());
OEInterface itf(InterfaceData, argc, argv);
oeifstream infile;
string inname=itf.Get("-in");
if (inname=="-")
{
if (!infile.openfd(STDIN_FILENO, true)) // read from stdin
OEThrow.Fatal("Unable to read from stdin");
}
else
{
if (!infile.open(inname))
OEThrow.Fatal("Unable to open input file: %s\n", inname.c_str());
}
oemolostream outfile;
if (!outfile.open(itf.Get("-out")))
OEThrow.Fatal("Unable to create output file: %s\n",
itf.Get("-out").c_str());
unsigned int language = OEGetIUPACLanguage(itf.Get("-language"));
OEGraphMol mol;
char buffer[8192];
bool done;
while (infile.getline(buffer,8192))
{
mol.Clear();
// Speculatively reorder CAS permuted index names
std::string str = OEReorderIndexName(buffer);
if (str.empty()) str = buffer;
if (language != OELanguage::AMERICAN)
{
str = OEFromUTF8(str.c_str());
str = OELowerCaseName(str.c_str());
str = OEFromLanguage(str.c_str(),language);
}
done = OEParseIUPACName(mol,str.c_str());
if (!done && itf.Get("-empty"))
{
mol.Clear();
done = true;
}
if (done)
{
if (itf.Has("-tag"))
OESetSDData(mol,itf.Get("-tag"),buffer);
mol.SetTitle(buffer);
OEWriteMolecule(outfile,mol);
}
}
return 0;
}


ChemMining
This software has the advantage, that it can be incorporated into Pipeline Pilot workflows. ChemMining can extract chemical entities from documents and includes: chemical names, IUPAC names, SMILES strings and common and brand names.

Chemical Entity Relationships Skill Cartridge
Developed by January 2006, this cartridge identifies chemical compound names, chemical classes and molecular formulae in text documents and translates extracted information into the chemist’s language: the chemical structure. The software includes the ability to identify chemical terms and assign them to specific chemical concepts according to semantic categories (another area of interest, the semantic web and chemistry, which I will blog about later in the year).

Chemical Content Recognition
This is a service delivered by MPirics. It allows chemists to chemical entities from text, the main goal being patents.

1 comment:

  1. Congratulations on your new job!

    If you try OPSIN I hope you will find that at least in some areas it is now comparable to the commercial offerings. Performance has improved considerably over the previous year.

    OPSIN's API is:
    NameToStructure nts = NameToStructure.getInstance();
    NameToStructureConfig n2sconfig = new NameToStructureConfig();
    OpsinResult result = nts.parseChemicalName(name, n2sconfig);
    Element cml = result.getCml();
    String inchi = NameToInchi.convertResultToInChI(result, false);

    or if you don't want to configure anything it can be as simple as:
    Element cml = nts.parseToCML(name);

    where name is the chemical name as a string.

    If you run into any chemical nomenclature problems feel free to drop me an email

    ReplyDelete