- Poster presentation
- Open Access
Complexity effects in fingerprint similarity searching
Chemistry Central Journalvolume 3, Article number: P5 (2009)
Similarity searching using fingerprint representations of molecules is widely applied for mining of chemical databases . Known active compounds are used as templates to search for novel hits using similarity measures for quantitative bit string comparison. A variety of similarity metrics are being used for this purpose including the popular Tanimoto coefficient  and the Tversky coefficients .
Differences in molecular complexity and size are known to bias the evaluation of fingerprint similarity . Complex molecules tend to produce fingerprints with higher bit density than simpler ones, which often leads to artificially high similarity values in search calculations. For example, we have thoroughly analyzed similarity value distributions and demonstrated that apparent asymmetry in Tversky similarity search calculations is a direct consequence of differences in fingerprint bit densities .
There are in principle two approaches to balance complexity effects; either by designing fingerprints that have constant bit density, regardless of the nature of test molecules, or, alternatively, by introducing similarity metrics that equally weight bit positions that are set on or off. We have shown that a size-independent fingerprint with constant bit density does not produce asymmetrical search results . In addition, a novel similarity metric has been developed, which not only balances complexity effects, but also results in further improved search performance compared to conventional calculations on Tanimoto similarity . However, highly complex molecules are generally much less suitable as reference compounds for fingerprint searching than active compounds having complexity comparable to the screening database . Random deletion of bits that are set on in complex templates has been shown to increase compound recall, despite the associated loss in chemical information content . Taking relative chemical complexity of reference and database compounds into account makes it possible to increase the success rates of fingerprint similarity searching.
Willett P, et al: J Chem Inf Comput Sci. 1998, 38 (6): 983-96.
Chen X, Brown F: Chem Med Chem. 2007, 2 (2): 180-2.
Flower D: J Chem Comput Sci. 1998, 38 (3): 379-86.
Wang Y, et al: Chem Med Chem. 2007, 2 (7): 1037-42.
Wang Y, Bajorath J: J Chem Inf Model. 2008, 48 (1): 75-84. 10.1021/ci700314x.
Wang Y, et al: Chem Biol Drug Design. 2008, 71 (6): 511-7. 10.1111/j.1747-0285.2008.00664.x.