RSC Publishing


Publishing

 

Along came ChemSpider...


28 July 2009

Antony Williams
Earlier this year the RSC acquired ChemSpider, a free online source of structure-based chemical information. Rachel Cooper and David Barden speak to Antony Williams, founder and VP of Strategic Development of ChemSpider, about crawling the web of chemistry and the benefits of joining with the RSC.


 

 

ChemSpider now has over 21 million entries - an immense achievement in just 2 years. What started it off?

There was a time when I was working extreme hours and my creative spirit was not being used to its full extent. I saw what was happening with online chemical databases, and thought how good it would be to create a chemical information resource that actually engaged the user in managing, qualifying and extending data. 

"About 5500 users visit the ChemSpider website every day, at present."
At the time, the company I was working for was working so hard on its own deadlines that it had little scope to take on more work, especially something that would provide little financial return, even though it would provide high value to the chemistry community. In discussions with some of my friends who share my passion for cheminformatics, we decided to set up a platform to host chemical structures that enabled the deposition and curation of related data. Wikipedia started in this way and look at it now - highly regarded, still imperfect, but very much a living community of participation. What more could we want for ChemSpider?


Did you need a big grant to get it all off the ground? 

ChemSpider was set up running from a basement using minimal computer resources. There was no funding or grants - a single server was bought using my bank account and then a couple of computers were built from parts. We went live in March 2007 hosting only the PubChem dataset, as a proof of concept, and then I started discussions with people I knew to get their data deposited. Today we've got over 200 data sources, and all the while we've been building a platform for improved data deposition and curation.

"ChemSpider has approx. 100,000 validated structures"
We've also extended the range of data we can handle, so now we are not just managing structures and alphanumeric text, but also images, spectra and documents.


You mentioned the imperfections of data available on the web, but how common are errors in online chemical data?

Very common! Chemical data on the internet is, to coin a phrase, 'diverse and dirty'. With so many databases online, and with various levels of expertise and care being given to preparing the data, there are lots of problems. Some common issues are the mis-association of names or identifiers with structures, incorrect structures (bad valences, charge imbalance, incomplete or incorrect stereochemistries) and incorrect association of data with chemical entities. It is a complex situation, and something that we've been working hard to clear up. To be sure, ChemSpider has inherited many errors, but they are being removed on a daily basis - incorrect structures are being deprecated, structure-identifier relationships are being clarified, and badly associated data are being removed. 


What should we do if we see an error in a ChemSpider entry?

The first thing that you should do is to let us know - don't leave it there for others to experience. Simply click on the 'Comments' box and send us a note describing what you believe to be wrong. Members of the curation team will then send a response to you, generally very quickly. All comments are welcome, as it is the best way to ensure that the quality of the data in ChemSpider continues to improve.


Everyone involved obviously works hard to improve the quality of the data. What is the most difficult thing to fix?

Trying to determine the correct structure that should be associated with a trivial chemical name can be very challenging. Let's take a natural product as an example, Ginkgolide B. This has a complex structure that is difficult to draw unambiguously, even for the most skilled chemist. Before curation, on ChemSpider there were seven structures called Ginkgolide B. Some had no stereocentres marked, some had full but incorrect stereochemistry, and some even had a completely different skeleton! From this point, it was a matter of researching the correct structure by consulting experts and the primary literature. We also needed to be aware of the timeline, as accepted structures can change when they are re-examined by new techniques. The benefit of all this is that, following this research, we can now be sure of the structure of Ginkgolide B, and have a depiction of it that is unambiguous - to humans and machines!

Ginkgolide B

Ginkgolide B - just one of the molecules that required some high-level curation by the ChemSpider team

 


Say I made a new molecule and wanted to deposit the data with ChemSpider - could I do that myself? 

"Depositing a single structure is about a minute's work"
Absolutely - just depositing a single structure is about a minute's work, and a bit longer if you want to add things such as DOIs, spectra and images. At the moment, though, it's easier for us to deal with depositions of a few thousand structures at once, as that makes qualifying the data easier.


Joining up with the RSC will presumably lead to changes for ChemSpider. What do you see as the main benefits? 

For ChemSpider the biggest benefit is that our hardware and software limitations will disappear, with backup systems in place to deal with power cuts, hard disk crashes and bandwidth constraints. Also, the ChemSpider team, instead of being volunteers working on the system in their spare time, will be employees, able to focus on improving the performance and capabilities of the system. One other enormous benefit is that the ChemSpider team brings well over 40 years of experience of cheminformatics into the RSC, and we hope to use this in integrating the software we've developed into RSC Prospect, leading to substantial enhancements in semantic enrichment. However, the ultimate beneficiaries will of course be the users, who will see improved performance and quality as more time is devoted to making ChemSpider a world-class resource.  


Who makes up the ChemSpider Team?

There are currently three members of the core ChemSpider team at the RSC; myself, Valery Tkachenko, and Sergey Shevelev. We all have experience in cheminformatics research and development of commercial products, deep expertise in structure handling, nomenclature, spectroscopy and algorithm-based predictions. Collectively, we have carried out research and developed commercial products for thousands of customers, and also have hands-on experience in managing large-scale projects. A key part of our expertise is in being able to engage the user-base to help design the optimal solution for their needs. We've also been involved in curating some of the chemical information on Wikipedia, such as making chemboxes consistent - an ongoing task!

Sergey, Antony and Valery

Left to right: Sergey, Antony and Valery - the three core members of the ChemSpider team.


And finally, what is the future for ChemSpider?

The vision for the future of ChemSpider is quite simple. When chemists are looking to search the internet for structure-based chemistry information, I want them to think of ChemSpider as their primary search engine.

"'I want chemists to think of ChemSpider as their primary search engine for chemical information'"
Whether it be to find suppliers, research articles, information on chemical properties, spectral or synthesis data, or the latest chemistry news, I would like them to come to ChemSpider first.

Related Links

Link icon ChemSpider Website
A free online source of structure-based chemical information.


External links will open in a new browser window



RSC acquires ChemSpider

ChemSpider acquisition heralds a breakthrough investment for RSC and for the Chemistry Community