Skip to main content
  • Poster presentation
  • Open access
  • Published:

Exploring benchmark dataset bias in ligand based virtual screening

A common finding of many reports evaluating VS methods is that validation results vary considerably with changing datasets, i.e. chemical space of the active ligands. It is assumed that these dataset specific effects are caused by the self-similarity and cluster structure inherent to these datasets.

As a first step, an experimental setup was developed that isolated dataset composition as the sole factor of variance influencing VS performance. The Hert-Willet benchmark datasets have been widely used for the validation of ligand based VS protocols. Various sampling strategies (D-optimum design, Onion-design, minimum distance design) were employed to generate archetypal subsamples from these datasets: (1) maximum diversity subsets, (2) space filling samples and (3) subsets with the minimum intra-set diversity. The analysis of the varying VS performance on these prototype datasets showed that dataset composition does indeed exert a critical influence on VS validation and identified local clustering and global spread of the datasets with respect to the set of decoys as the factors with the highest impact on VS performance.

Keeping the concept of chemical space in mind, it is reasonable to make use of the field of spatial statistics, which offers a wealth of methods for the analysis of clustering, patchiness and dispersion of datasets. By employing these, we were able to analyse the spatial composition of the benchmark datasets in more detail and derive several rules of thumb for choosing unbiased datasets for evaluating ligand based VS methods.

Author information

Authors and Affiliations


Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Baumann, K., Rohrer, S. Exploring benchmark dataset bias in ligand based virtual screening. Chemistry Central Journal 2 (Suppl 1), P1 (2008).

Download citation

  • Published:

  • DOI: