Skip to main content
  • Oral presentation
  • Open access
  • Published:

Graph representation of molecular datasets: applications to dataset visualization and comparison using graph indices

The visualization of large chemical datasets in multidimensional chemistry spaces is a great challenge in cheminformatics. To this end, we have developed a novel approach using graph representations for ensembles of molecules-points defined by their coordinates in high dimensional descriptor space. The Advanced Dataset Graph (ADG) consists of an ensemble of vertices (corresponding to molecules) connected by edges; two vertices are connected by an edge if the Euclidean distance between the vertex-molecules in the original descriptor space is within a user-defined cut-off. The ADDAGRA program has been implemented to build and visualize ADGs in three-dimensional graphic user interface. The uniqueness of this data representation is in that the points are projected onto 3D space using conventional Principal Component Analysis; however the projected vertices are connected by edges only if they are defined as neighbours in the original high dimensional space. Thus, unlike all other data projection approaches the ADG visualizes compound clusters in the original descriptor space exactly. In addition to the visualization, we have also implemented several simple graph indices for quantitative description and comparisons of ADGs. Three case studies involving (i) 101 AmpC beta-lactamase inhibitors, (ii) 2029 organic compounds with their measured intrinsic aqueous solubility, and (iii) 1093 chemical toxicants (see Figure 1) tested against Tetrahymena Pyriformis, have been analyzed with our ADG approach. Results suggest that some graph indices such as the average vertex degree or Randic connectivity index have the ability to discriminate similar vs. dissimilar pairs of datasets and address several other common issues in cheminformatics such as detection of outliers, shared regions in chemical and property space, etc. We suggest that ADGs could find a broad use for many cheminformatics applications.

Figure 1
figure 1

Advanced Dataset Graphs built for the "aquatic toxicity datasets, including the training set TR (red – 644 compounds), and two external sets TS, involving 339 (black) and 110 (blue) compounds respectively.

Author information

Authors and Affiliations


Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Tropsha, A., Fourches, D. Graph representation of molecular datasets: applications to dataset visualization and comparison using graph indices. Chemistry Central Journal 3 (Suppl 1), O9 (2009).

Download citation

  • Published:

  • DOI: