Skip to main content
  • Oral presentation
  • Open access
  • Published:

Is learning drugs the same as learning non-drugs?

In their recent paper [1] Good and Hermsmeier discuss the effects of test set selection on the evaluation and comparison of SAR methodologies. In particular they examine the impact that analogue effect has on overestimating the predictivness of drug vs non-drug models built by Bayeisan modeling. The analogue effect is a result of most drug and druglike compendia having extensive sets of analogues. When selecting random test sets this tends to ensure that test and training sets both contain members of the same series, which in turn means that the predictivity of a model is greater than it might otherwise be.

Good and Hermsmeier proposed a protocol to evaluate models that reduces this effect, by first organizing a drug database into classes based on the drug ontology classes defined by Schuffenhaur et al. [2]. They then learned models from training sets that excluded a particular class of drug and tested the predictivity on test sets from that class. The authors focused on the ability of various methods to minimize type II errors (false negatives), that is the prediction of drugs as non drugs.

After reproducing their work as far as we are able, we have extended their study to also consider the effects on type I errors (false positives), that is the prediction of a non drug as a drug, which will be an important consideration when one considers the practicality of these methods for selecting sets of samples for synthesis or purchase and screening.

In the reproduction of the original work, we largely concur with the authors that descriptors that encode small and more abstract features of molecules are the most effective at minimizing type II errors. However, minimizing type I errors we found these types of descriptors not to be so effective, and that it was descriptors that encompassed much larger fragments produced the most predictive models. In other words, the descriptors required for learning non-drugs are fundamentally different from those required to learn drugs. We propose therefore that an experiment can be tailored to meet the requirements of precision vs recall by adjusting the environment size that is encoded by the descriptors.


  1. Good AC, Hermsmeier MA: J Chem Inf Model. 2007, 47: 110-114. 10.1021/ci6003493.

    Article  CAS  Google Scholar 

  2. Schuffenhauer A, et al: J Chem Inf Comput Sci. 2002, 42: 947-955. 10.1021/ci010385k.

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations


Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Brown, R.D., Rogers, D. Is learning drugs the same as learning non-drugs?. Chemistry Central Journal 2 (Suppl 1), S5 (2008).

Download citation

  • Published:

  • DOI: