Sunday, 5 December 2010

Edit & View Molecules on the go!

Having newly acquired an IPad, the first thing I did was download some useful apps for it that I already have on my iPhone. I then began to look for cheminformatics apps that might be useful as part of my work, and came across "Mobile Molecular Datasheets". This is an app that allows a user to view and edit chemical structures on the IPad. More details on the app can be found at: The app is developed by Molecular Materials Informatics and was last updated on the 2nd of December 2010. It is supported on iOS4.2+ and is 1.9mb to download. It does however cost £8.99 (so not the cheapest of apps).
The molecules are essentially organized into a collection of data sheets. They can be shared or sent via email. The current formats supported are Mol and SDF files. The app has the advantage that it can be integrated into workflows.

Mobile Molecular Datasheets has support for:
1. New data sheets of molecules.

2. New reactions.
3. New molecule templates. Examples of templates include: large rings, crown ethers, multi-dentate ligands cage structures, amino acids, cage complexes, biomolecules and saccharides.
4. A web service.

If anyone has any experience with additional cheminformatics IPad or iPhone apps it would be great to hear from you.

Thursday, 28 October 2010

Redirecting Eclipse Output to a file

Whilst I think this tip is quite fundamental, it's worth noting. Being able to direct output from the Eclipse console to a file, particularly if there are warnings generated from a program you have little to no control over and the file is rather large, and the start of the output goes off the console screen.

To redirect to a file simply:
1. Go to the Run menu
2. Run-> Run Configurations
3. Go to Common tab
4. Tick file on standard input/output
5. The console output will then be written to that file.

Sunday, 17 October 2010

The Future Is Bright, The Future is Orange

A recent post by Richard L. Apodaca on the use of Knime work flows in Eclipse for cheminformatics, provoked me to look at another piece of software, Orange. Orange has been around for some time, and is an opensource data visualization/ mining toolkit written in the Python programming language. The GUI is built on QT .

I recently downloaded the MAC OSx bundle, and was pleasantly surprised by the ease in which workflows could be created (see attached image). Using the Orange GUI is easy, it allows you to read in files of different formats, process or filter attributes, to cleanly visualize the data, data distributions, to classify data, show confusion matrices and ROC curves etc.

Since I am a big fan of Eclipse, I wanted to access the scripting side of the Orange library through Eclipse. Setting up a Pydev project is easy, however, when I came to run my program:

Created on Oct 16, 2010
Example of using orange python -> constructs Naive Bayesian Classifier
@author: eoc21
import os, sys, orange

class ClassifierExample():
def __init__(self,fileName): = orange.ExampleTable(fileName)
self.classifier = orange.BayesLearner([2:])

def runBayesLearner(self):
for i in range(2,20):
c = self.classifier([i])
print "original",[i].getclass(),"classified as", c

def printProbabilities(self):
for i in range(2,20):
p = self.classifier([i],orange.GetProbabilities)
print "%d: %5.3f (originally %s)" % (i+1, p[1],[i].getclass())

if __name__ == '__main__':
example = ClassifierExample(sys.argv[1])

I came up against the error " can't work with 64 bit architecture", since I'm running Snow Leopard, which defaults to 64 bits, I had to set a variable called: VERSIONER_PYTHON_PREFER_32_BIT

to yes.

Everything then worked cleanly.

Orange is primarily for machine learning, however it also has tools to support workflows in bioinformatics, one can also use the molecule visualizer to view smiles strings from a file.

Sunday, 3 October 2010

CML - who uses it?

I was very intrigued to hear from a colleague/fellow developer, that apparently CML (Chemical Markup Language) is not very well used in the field of cheminformatics. Is this true of just industry and private companies? Does academia (other than the Murray-Rust group and Henry S. Rzepa) relish this format, I would be interested to hear peoples' views.

From the Journal of Chemical Information and Modeling, "Chemical Markup Language" has retrieved 76 hits. Of which, 17 of these papers have included either PMR or Henry S. Rzepa (approx 22.4%).

If CML is not the chosen chemical format, what is the predominate format? SMILES, Inchi, Inchi key, sdf, mol2, etc? What will be the predominate format of the future? RDF, OWL?

Thursday, 2 September 2010

Core Cheminformatics Competencies

One article I found on line that interested me was "10 Useful Bioinformatics Skills to Have". Clearly all ten points apply to a professional in the field of cheminformatics. However it would be interesting to know what qualities are considered most highly amongst employers.

For this purpose, I have reviewed the jobs advertised on the Computational Chemistry List from the 19th of February 2010 to the 1st of September 2010. Alternative job sources could have been found, I agree, however the site is fairly representative of the types of jobs people having finished a degree in computational chemistry may consider.

Have Your Cake and Eat It
Of the 125 jobs, the pie chart above shows the distribution of cheminformatics jobs by country around the world. Clearly the cheminformatics job market is dominated by the USA, taking just over 50% of all the jobs advertised on the over the last 8 months. Germany has taken silver on the podium with 16 positions available, whilst the UK has taken bronze with 10.

What skills are employers seeking?
Well this is a good question, now that we've established where the main job market / demand for cheminformatics jobs are.

The top ten skill expected by an employer are:
1. PhD (75).
2. Experience in Molecular Dynamics (36).
3. Experience in programming (34).
4. Simulation experience (29).
5. Strong oral communication skills (26).
6. Strong written communication skills (26).
7. Experience with Linux (22).
8. Python programming (22).
9. Team player (17).
10.Experience with docking software.

For the more interested reader, I have given a more detailed breakdown of the results in the table below.

Linux 22
Molecular Dynamics 36
Programming Language 34
Organization skills 2
Oral Communications 26
Written communications 26
Working in a team 17
PhD 75
Simulation Experience 29
Docking 17
Python 22
R 5
Java 13
C 16
C++ 16
Tcl 1
C# 2
Ruby 1
Perl 8
Fortran 7
Matlab 4
Database Knowledge 5
Molecular Modeling 18
Virtual screening 11
Genomics 3
Pharmacophore Elucidation 5
Quantum Mechanics 19
Conformational Analysis 1
Homology Modeling 8
Ligand based design 7
Structure based design 9
Super computing experience 3
Parallel programming 3
Experience in force field development 3
Willingness to travel 3
Scripting 10
Workflow experience 3
Machine Learning 4
Chemogenomics 2
Semantic Web technologies 1

Thursday, 19 August 2010

Boston ACS Fall 2010

I am just preparing the last few tweaks to my poster before I present it in the ( CINF Scholarship for Scientific Excellence) (Chemical Information) on Sunday the 22nd of August. This should be an exciting event, there are 8 other posters. The posters range in content from semantic web applications, through to toxicity prediction and virtual screening.

With regards to the rest of the program, unfortunately I can not stay past Monday, however some talks that I would have liked (may not get to see on Sunday) to have seen include:

In the general papers

1.#84, Chemistry in your hand by Dr Anthony J. Williams (ChemSpider).

2. #86 Extracting information from the IUPAC Green Book, by Prof Jeremy G Frey from the University of Southampton.

Data Intensive Drug Design
1. #12 Public-domain data resources at the European Bioinformatics Institute and their use in drug discovery, by Christoph Steinbeck.

2. #16 Data drive life sciences: The Pyramids meet the Tower of Babel by
Dr. Rajarshi Guha, NIH.

Recent Progress in Chemical Structure Representation

1. #67 Recent IUPAC recommendations for chemical structure representation: An overview by Mr. Jonathan Brecher, CambridgeSoft.

2.#69 Line notations as unique identifiers by Krisztina Boda PhD.

There are also a number of presentations from the Semantic Web in Chemistry division.

#4 Chemical e-Science Information Cloud (ChemCloud): A semantic web based eScience
infrastructure, by Prof. Dr. Adrian Paschke, FIZ Chemie, Berlin.

#36 ChemicalTagger:A tool for semantic text-mining in chemistry
by Dr Lezan Hawizy, University of Cambridge.

Sunday, 8 August 2010

Wrapping it all up

Since I'm going to be dealing with quite a lot of C++ code, it's interesting to know how one could use this functionality from a different programming language all in the cozy environment of one's own living room -the Eclipse IDE.

To get started, all we need is three ingredients: eclipse, SWIG and the SWIG plugin for eclipse called sKWash. For those unfamiliar with SWIG, according to their website, it is

"a software development tool that connects programs written in C and C++ with a variety of high-level programming languages."

Essentially you take some C or C++ code,

/* File : example.c */

double My_variable = 3.0;

int fact(int n) {
if (n <= 1) return 1;
else return n*fact(n-1);

int my_mod(int x, int y) {
return (x%y);

char *get_time()
time_t ltime;
return ctime(<ime);

write an interface file

/* example.i */
%module example
/* Put header files here or function declarations like below */
extern double My_variable;
extern int fact(int n);
extern int my_mod(int x, int y);
extern char *get_time();

extern double My_variable;
extern int fact(int n);
extern int my_mod(int x, int y);
extern char *get_time();

build the module, i.e. for python:

unix % swig -python example.i
unix % gcc -c example.c example_wrap.c \
unix % ld -shared example.o example_wrap.o -o

We can now use the Python module as follows :

>>> import example
>>> example.fact(5)
>>> example.my_mod(7,3)
>>> example.get_time()
'Sun Feb 11 23:01:07 1996'

This however involves the command line. The advantage of sKWash is that it provides a GUI to SWIG through eclipse. An example of its usage is shown here; which can of course be extended to more complicated code such as the conversion of whole libraries e.g. with Quantlib a C++ library for quantitative finance. The library can then be used from the desired programming language.

In brief, you create two C++ projects in eclipse, one to store your main code, the other as a container for the C++ wrapper generation. Then create a Java or "target language" project you require in eclipse, this acts as the container for wrapper code for the target language. In the final stage create a sKWash project and specify the C++ wrapper generation project and the target language project. Eclipse will then generate the interface files, then build the project. The wrapper code should now be displayed in your wrapper projects.

Sunday, 1 August 2010

Mash it all together using Yahoo! Pipes

I came across Yahoo Pipes! completely by random. It is essentially a web application provided by Yahoo! to build data mashups from web feeds, web pages and other web services and has a GUI to simplify the process.

One application of Yahoo Pipes! that I found of interest, was to aggregate data on jobs taken from the (Computational Chemistry Jobs Board) by Sara Nichols, a postdoc in the McCammon group.

She has created a pipe capable of producing either RSS feeds or JSON that can be incorporated into webpages.

Saturday, 31 July 2010

This may seem obvious

For some, it may be obvious / straight forward to implement, however the number of times I have been approached to ask for a script to transform the output of integer based fingerprints, such as MACCS keys or pharmacophore fingerprints coming out of MOE (the Molecular Operating Environment) into binary 000110101 representations so that molecules can be compared based on their Tanimoto similarity or some other measure or for input into a machine learning algorithm is astounding.

So here is a useful script that will convert integer based fingerprints into binary fingerprints.

import os, sys

class ConvertIntegerFPToBinary():
def __init__(self,descriptorFile,outputFile,label):
self.iFile = descriptorFile
self.oFile = outputFile
self.binaryFingerprints = []
self.integerList = []
self.label = label

def populateList(self,maxBitSize):
for i in range(1,maxBitSize+1):

def convertData(self,maximumBitSize):
inputFile = open(self.iFile,'r')
data = inputFile.readlines()
for i in range(1,len(data)):
splitdata = str(data[i]).replace("\"", "").split()
binaryFingerprint = []
for j in range(len(self.integerList)):
if self.integerList[j] in splitdata:
return self.binaryFingerprints

def postProcessing(self,fp):
binaryFingerprint = str(fp).replace("[", "").replace("]", "").replace("'", "")
return binaryFingerprint

def writeToFile(self):
outputFile = open(self.oFile,'w')
for i in range(len(self.binaryFingerprints)):
processedFingerprint = self.postProcessing(str(self.binaryFingerprints[i]))
if i == len(self.binaryFingerprints)-1:

if __name__ == '__main__':
converter = ConvertIntegerFPToBinary(sys.argv[1],sys.argv[2],sys.argv[3])
binaryFingerprints = converter.convertData(166)

Wednesday, 28 July 2010

Protection, Protection, Protection!

A long long time ago when I can still remember (well I think it was about 3 years ago), I set up a MediaWiki on my website (installation instructions are here). Whilst I was reaping the benefits of having a nice Web 2.0 interactive site, with ease of adding new content and web pages, something more sinister was lurking in the background. It was spam. Because there were few extensions at this time to prevent spam I thought it wouldn't happen to me. I was later told by my server administrator that I had a few gigs of spam on my wiki that was slowing down the whole server.

After removing the last wiki, I have now bravely decided to install it again (fortune favours the bold). This time round however, the first thing I have done is install all the possible extensions to combat spam. There is an exhaustive list of extensions here. There are a number of ways that one can restrict spam:

1. Restrict reading of the wiki, by setting $wgGroupPermissions['*']['read'] = false;
2. Prevent account creation, except by the administrator $wgGroupPermissions['*']['createaccount'] = false;
3. Restrict anonymous editing by setting $wgGroupPermissions['*']['edit'] = false;
4. Admin can also use the 'protect' facility to restrict access to certain pages.
5. Removing the login link / create account link using the following function:

function NoLoginLinkOnMainPage( &$personal_urls ){
unset( $personal_urls['login'] );
unset( $personal_urls['anonlogin'] );
return true;

6. Capatcha extension
The fact that edits have to be confirmed means that most bots will not be able to make edits to the wiki. This extension is simple to install, simply download the php file and add require_once( "$IP/extensions/ConfirmEdit/ConfirmEdit.php" ); to your localSettings.php file, like all other additions shown above.

Now that my wiki is much more secure I thought I would also add a few more extensions. Clearly it is useful to have your calendar displayed without going to say your gmail all the time. MediaWiki has an extension for this here. Just download the extension then go to this link to get details on your calendar and copy and paste all information between "googlecalendar" tags, ignoring all comments before the '?' and removing the "iframe" tag at the end.

I have also installed the chemistry extension, which works, but in my opinion is in rather a beta stage of development. It allows you to write chemical formulas into your wiki between "chemform" tags and it automatically corrects the sub and superscripting of atom numbers and charges. See here for examples. One of the problems however that I have found with this extension, is that say for example you have dioxygen with a single negative charge, and put 02- it actually displays the molecule as though it has one oxygen atom with a -2 charge. Clearly some work will be needed on this extension, but it's interesting to see freely available chemistry extensions coming out of the wood work for MediaWiki.

Thursday, 22 July 2010

Python makes the REST look easy

Web services are becoming an increasingly common phenomenon in the field of chemoinformatics, where data and services are now being published more openly on the internet. For those unfamiliar with the concept of web services, Wikipedia spells it out to be "application programming interfaces (API) or web APIs that are accessed via Hypertext Transfer Protocol (HTTP) and executed on a remote system hosting the requested services."

Traditionally people used SOAP web services, however they have waned in popularity relative to a new paradigm of web service called REST, which stands for Representational State Transfer. More information on RESTful web services can be found at the above link.

I would just like to illustrate with the use of some code, the simplicity of accessing some RESTful web services using Python. The web services we will be accessing are chemoinformatics web services running on a server at the University of Indiana under the CHEMBIOGrid projects.

The two services we will be accessing are:
1. 3D structure for a PubChem compound, retrieved from the Pub3D database
2. Generation of a 3D structure from a SMILES string, using the smi23d program

The code to access these services is shown below.

Created on Jul 22, 2010
Program to access a REST web service hosted at Indiana University
@author: ed
import urllib2, os, sys

class ChemoinformaticsRESTWebServices():
def __init__(self,url):
self.url = url

def callWS(self,parameter):
url = self.url+parameter
data = urllib2.urlopen(url).read()
print data
except urllib2.HTTPError, e:
print "HTTP error: %d" % e.code
except urllib2.URLError, e:
print "Network error: %s" % e.reason.args[1]

if __name__ == '__main__':
#3D structure for a PubChem compound, retrieved from the Pub3D database
ws = ChemoinformaticsRESTWebServices("")
#Generation of a 3D structure from a SMILES string, using the smi23d program
ws2 = ChemoinformaticsRESTWebServices("")

Wednesday, 21 July 2010


Not sure of the title? Well if you knew Japanese you'd be ok. Since for some, or many of us, Japanese may not be the easiest of languages to pickup or understand, as an English speaker, I haven't got a clue what it means. Other than picking up a dictionary or attending some Japanese classes, you can use Google Translate. The advantage of Google Translate, is that it can convert between a multitude of different languages. If the suspense is killing you, the title actually means "Benzene".

I would be interested to see how well Google Translate works on other chemical names. If you're interested in converting say a batch load of names, there are Java and Python APIs available to access the web service.

Converting the name Benzene in English to its Romanian counterpart Benzen, works correctly.

Taking something more elaborate like: 13-amino-N-(2-{2-[(2-{[2-(2-aminoethyl)amino]ethyl}amino)ethyl]amino}ethyl)-

returns -> 13-amino-N-(2 - (2 - [(2 - ([2 - (2-amino) etil] amino) amino) etil] amino) etil) -

which I think is also right (with the exception of the loss in curly brace notation), though I will have to check.

Example Java code to translate

package eoc21;


public class TranslateChemicalNames {

public static void main(String[] args) {
try {
String translatedText = Translate.translate("tri-oxa-tri-silinane",
Language.ENGLISH, Language.ROMANIAN);
} catch (Exception ex) {

Saturday, 17 July 2010

Ultrafast Shape Recognition

A few days ago, I was asked by a colleague who is working in the field of QSAR and virtual screening, to write an implementation of Pedro Ballester's Ultrafast Shape Recognition (USR) descriptor using the Python programming language.

Ballester's descriptor is a fast way to find molecules that closely resemble leads based on their shape. It has been shown to avoid the alignment problem, and to be up to 1500 times faster to calculate than other current methodologies. The shape descriptor makes the assumption that a molecule's shape can be uniquely defined by the relative position of its atoms and that three-dimensional shape can be characterised by one-dimensional distributions.

The source code along with HTML API documentation can be found on my github I hope this is of use to people. I have also uploaded a trial dataset of A42731 (Substance P Antagonists) and the resultant USR descriptor file.

Sunday, 11 July 2010

Designing a website using the Google Web Toolkit

Since I'm now going to be coding predominately in C++, I thought it would be nice to keep the Java going and give my website a new face. Whilst I'm not going to compare different JavaScript frameworks, I will give a few pointers as to why I chose the Google Web Toolkit (GWT).

  1. Firstly, I have previously developed a semantic web application using the GWT.
  2. Secondly, I have a stronger background in Java programming than Javascript. GWT allows you to write AJAX applications in Java and then compile the source to highly optimized JavaScript that runs across all browsers.
  3. Thirdly, because its Java I can use my favourtie IDE, Eclipse. Eclipse has a large number of tools to assist the developer, it also has the advantage that it comes with a GWT plugin, which can create projects seamlessly. Eclipse supports a number of version control systems (CVS, SVN, GitHub), necessary for any software project.

So if one were to develop a website or an online application, with GWT and Eclipse and support it under say the GitHub, how would one go about doing that?
Just follow these steps:

  1. Download Eclipse from
  2. Add the GWT plugin. The appropriate urls are:
Eclipse 3.6 (Helios)
Eclipse 3.5 (Galileo)
Eclipse 3.4 (Ganymede)
Eclipse 3.3 (Europa)

3. Restart your Eclipse to make sure the workspace is refreshed.

4. To create a new GWT project, simply click on the GWT project icon in blue and white, or go to File-> New -> Project -> Google -> WebApplicationProject.
Click on the next button, then you will be prompted for a project name and package. You will also be given the option to specify the SDK you want to use in my case I have installed both the GWT -1.7.0 and GWT 2.0.4 versions. I would suggest upgrading to the newest version as from the GWT 2.0 mile stone, they no longer provide individual operating system .jar files. Previously I had a gwt-dev-linux.jar for my Ubuntu machine, however now the distributable .jar is called gwt-dev.jar.

This should produce a project structure similar to the one on my GitHub.

Code for the main class has been copied in here:

package eoc21.client;

import eoc21.shared.FieldVerifier;
















* Entry point classes define onModuleLoad().


public class GWTWebSite implements EntryPoint {


* The message displayed to the user when the server cannot be reached or

* returns an error.


private static final String SERVER_ERROR = "An error occurred while "

+ "attempting to contact the server. Please check your network "

+ "connection and try again.";

* Create a remote service proxy to talk to the server-side Greeting service.
private final GreetingServiceAsync greetingService = GWT

* This is the entry point method.
public void onModuleLoad() {
final Button sendButton = new Button("Send");
final TextBox nameField = new TextBox();
nameField.setText("GWT User");
final Label errorLabel = new Label();

// We can add style names to widgets

// Add the nameField and sendButton to the RootPanel
// Use RootPanel.get() to get the entire body element

// Focus the cursor on the name field when the app loads

// Create the popup dialog box
final DialogBox dialogBox = new DialogBox();
dialogBox.setText("Remote Procedure Call");
final Button closeButton = new Button("Close");
// We can set the id of a widget by accessing its Element
final Label textToServerLabel = new Label();
final HTML serverResponseLabel = new HTML();
VerticalPanel dialogVPanel = new VerticalPanel();
dialogVPanel.add(new HTML("Sending name to the server:"));
dialogVPanel.add(new HTML("
Server replies:"));

// Add a handler to close the DialogBox
closeButton.addClickHandler(new ClickHandler() {
public void onClick(ClickEvent event) {

// Create a handler for the sendButton and nameField
class MyHandler implements ClickHandler, KeyUpHandler {
* Fired when the user clicks on the sendButton.
public void onClick(ClickEvent event) {

* Fired when the user types in the nameField.
public void onKeyUp(KeyUpEvent event) {
if (event.getNativeKeyCode() == KeyCodes.KEY_ENTER) {

* Send the name from the nameField to the server and wait for a response.
private void sendNameToServer() {
// First, we validate the input.
String textToServer = nameField.getText();
if (!FieldVerifier.isValidName(textToServer)) {
errorLabel.setText("Please enter at least four characters");

// Then, we send the input to the server.
new AsyncCallback() {
public void onFailure(Throwable caught) {
// Show the RPC error message to the user
.setText("Remote Procedure Call - Failure");

public void onSuccess(String result) {
dialogBox.setText("Remote Procedure Call");

// Add a handler to send the name to the server
MyHandler handler = new MyHandler();

Essentially we create a class for the entry point for the website (i.e. home), we implement the onModuleLoad() method. The tutorial defines a number of widgets: a button called "send", a TextBox with a default value "GWT", the text box and button are added to the root panel. A DialogBox is then created that will display the results. A close button is added to the DialogBox with a click handler to close the DialogBox if a user clicks on the close button.
A nested class called MyHandler is constructed that implements both the ClickHandler and KeyUpHandler interfaces, which has to implement the onClick() and onKeyUpUp() methods. If a user clicks the send button or types in information in the textbox, it is sent to the server, via the sendNameToServer() method, which implements an asynchronous remote proceedual call to the server.

5. Get Egit
You're probably now wondering how I integrated Eclipse with GitHub? Well, there is also a plugin to allow you to push your code to GitHub. The plugin is called Egit, and can be downloaded from: A full and more comprehensive tutorial on Egit can be found at

6. Mavenize and manage those dependencies
If you're new to Java, or you haven't heard of Maven, you should read this ( Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information. I will talk about maven, and mavenizing a GWT project in another post.

GWT Simple Demo
The result of our simple GWT application / website demo is shown in the image below. On clicking the send button an RPC (remote proceedual call) is made, the server responds with "Sending name to the server"+ your input text, then it replies with "hello" and the name of the user and details of the operating system and browser used (in this case Ubuntu 8.0.4 and FireFox).

Tuesday, 6 July 2010

Name2Structure and Back Again, A New Adventure

A recent presentation I found on line, titled "Chemical Information Mining: Possibilities and Pitfalls", attracted my attention. It was given by Neil Stutchbury on the 20th of May 2009 at the CICAG (Chemical Information and Computer Applications Group) Scientific Text and Data Mining conference in Burlington House London. The presentation covered a number of key points, firstly extracting chemical entities in text, secondly converting chemical names to structures, thirdly a user survey and finally points for future possible extensions.

Since I have now accepted a position at OpenEye, to primarily work on Lexichem, I am particularly interested in his discussion on name to structure. You're probably wondering what comes under the umbrella of name? Well there are a number of ever emerging types of names, too many to list in this blog post. However the ones currently under discussion are: trade names, common names, natural products, abbreviations, molecular formulae and IUPAC names. A trade name according to wikipedia, is "the name which a business trades under for commercial purposes", an example of usage would be "Aspirin"

which is dissimilar to its chemical name "acetylsalicylic acid". A common name, is a name in general use in a community, caffiene is a common name for 1,3,7-trimethyl-1H-purine- 2,6(3H,7H)-dione. Natural products are considered chemical compounds or substances produced by a living organism such as Taxol. In chemistry, a number of abbreviations are used to shorten the form of a word or phrase, examples include DMSO - dimethyl sulfoxide and THF - tetrahydrofuran. People in everyday life often talk about ordering H2O at the bar after a heavy night out, H20 is the formula for water.

IUPAC names are based on IUPAC nomenclature, a system of naming chemical compounds. It is developed and kept up to date by the International Union of Pure and Applied Chemistry (IUPAC). There are currently two books detailing rules for naming organic and inorganic compounds, which will be the subject of a future blog post.

When starting a new position, in any field, it is always interesting to know the competition. So who (both open and commerical) produces name to structure tools? Well, the (main) suppliers are shown in the table below:

University of Cambridge ->OSCAR
ChemAxon ->
InfoChem -> Annotator and ICN2S
ChemMantis -> SureChem (NER) with ACD/Name; ACD/NTS / Batch (N2S)
CambridgeSoft ->Name=Struct
OpenEye ->Lexichem TK
Accelrys ->ChemMining
TEMIS/MDL ->Chemical Entity Relationships Skill Cartridge
MPirics ->Chemical Content Recognition

Having first hand experience, coming from the University of Cambridge and Peter Murray-Rust's group, I can recommend OPSIN (the N2S module), which is open source code, and written in Java. Whilst OPSIN is showing progress, it is still in its infancy.
A product of ChemAxon, is hosted online. They have used the jQuery javascript library to create a form that a user can paste in chemical names or upload a small (4KB) file to convert names to structure. It should also be noted however that ChemAxon have their own Java API to convert names to structure and structures to names.

A simple program to read in molecules, then convert the molecules to IUPAC and traditional names is shown.

package eoc21;


import chemaxon.formats.MolFormatException;
import chemaxon.formats.MolImporter;
import chemaxon.license.LicenseManager;
import chemaxon.license.LicenseProcessingException;
import chemaxon.marvin.calculations.IUPACNamingPlugin;
import chemaxon.marvin.plugin.PluginException;
import chemaxon.struc.Molecule;

* Converts structure to name using ChemAxon's java API.
* @author ed
public class Name2Structure {

public static void main(String[] args) throws PluginException, MolFormatException, IOException, LicenseProcessingException {
MolImporter mi = new MolImporter("150mols.smi");
Molecule mol;
IUPACNamingPlugin plugin = new IUPACNamingPlugin();
while((mol = != null){
String preferredIUPACName = plugin.getPreferredIUPACName();
String traditionalName = plugin.getTraditionalName();

This product can extract chemically relevant entities from text and convert the names to structure. It supports: systematic names, trivial names, trade names as well as inchis and CAS Registry Numbers.

An integrated system for chemistry-based entity extraction and document mark-up enabling access to the rich resource of online chemistry know as ChemSpider.

CambridgeSoft supports both name to structure and structure to name conversions. It offers support for charged compounds and salts, highly symmetric structures and many other types of inorganic and organometallics.

Lexichem TK
A C++ toolkit provided by OpenEye to convert names to structure, structure to names and has the ability to convert the names into different languages such as Japanese, Romanian and Hungarian. Example of name to structure code taken from the Lexichem documentation is given below:

#include "openeye.h"
#include "oeplatform.h"
#include "oesystem.h"
#include "oechem.h"
#include "oeiupac.h"
#include "nam2mol_example.itf"
using namespace OEPlatform;
using namespace OESystem;
using namespace OEChem;
using namespace OEIUPAC;
using namespace std;
#define STDIN_FILENO 0
int main(int argc, char *argv[])
OEThrow.Info("Lexichem nam2mol example");
OEThrow.Info(" Lexichem version: %s", OEIUPACGetRelease());
OEInterface itf(InterfaceData, argc, argv);
oeifstream infile;
string inname=itf.Get("-in");
if (inname=="-")
if (!infile.openfd(STDIN_FILENO, true)) // read from stdin
OEThrow.Fatal("Unable to read from stdin");
if (!
OEThrow.Fatal("Unable to open input file: %s\n", inname.c_str());
oemolostream outfile;
if (!"-out")))
OEThrow.Fatal("Unable to create output file: %s\n",
unsigned int language = OEGetIUPACLanguage(itf.Get("-language"));
OEGraphMol mol;
char buffer[8192];
bool done;
while (infile.getline(buffer,8192))
// Speculatively reorder CAS permuted index names
std::string str = OEReorderIndexName(buffer);
if (str.empty()) str = buffer;
if (language != OELanguage::AMERICAN)
str = OEFromUTF8(str.c_str());
str = OELowerCaseName(str.c_str());
str = OEFromLanguage(str.c_str(),language);
done = OEParseIUPACName(mol,str.c_str());
if (!done && itf.Get("-empty"))
done = true;
if (done)
if (itf.Has("-tag"))
return 0;

This software has the advantage, that it can be incorporated into Pipeline Pilot workflows. ChemMining can extract chemical entities from documents and includes: chemical names, IUPAC names, SMILES strings and common and brand names.

Chemical Entity Relationships Skill Cartridge
Developed by January 2006, this cartridge identifies chemical compound names, chemical classes and molecular formulae in text documents and translates extracted information into the chemist’s language: the chemical structure. The software includes the ability to identify chemical terms and assign them to specific chemical concepts according to semantic categories (another area of interest, the semantic web and chemistry, which I will blog about later in the year).

Chemical Content Recognition
This is a service delivered by MPirics. It allows chemists to chemical entities from text, the main goal being patents.