Skip to main content

Exploring the role of topological descriptors to predict physicochemical properties of anti-HIV drugs by using supervised machine learning algorithms

Abstract

In order to explore the role of topological indices for predicting physio-chemical properties of anti-HIV drugs, this research uses python program-based algorithms to compute topological indices as well as machine learning algorithms. Degree-based topological indices are calculated using Python algorithm, providing important information about the structural behavior of drugs that are essential to their anti-HIV effectiveness. Furthermore, machine learning algorithms analyze the physio-chemical properties that correspond to anti-HIV activities, making use of their ability to identify complex trends in large, convoluted datasets. In addition to improving our comprehension of the links between molecular structure and effectiveness, the collaboration between machine learning and QSPR research further highlights the potential of computational approaches in drug discovery. This work reveals the mechanisms underlying anti-HIV effectiveness, which paves the way for the development of more potent anti-HIV drugs. This work reveals the mechanisms underlying anti-HIV efficiency, which paves the way for the development of more potent anti-HIV drugs which demonstrates the invaluable advantages of machine learning in assessing drug properties by clarifying the biological processes underlying anti-HIV behavior, which paves the way for the design and development of more effective anti-HIV drugs.

Peer Review reports

Introduction

Human Immunodeficiency Virus (HIV) was firstly identified in the early 1980s as a consequence of the appearance of an immune system-damaging disease [1]. Later on, the illness was identified as Acquired Immunodeficiency Syndrome (AIDS). In 1983–1984, French scientists Francoise Barre-Sinoussi and Luc Montagnier became essential in discovering the virus. HIV caused a global pandemic that has killed countless people and infected millions of people globally. Its impact on global health is immense, as it not only threatens human health but also affects economies and healthcare systems around the globe [2]. There are two primary types of HIV: HIV-1, which is common surrounding the world, and HIV-2, which is primarily linked to West Africa. Here we focus on HIV-1, HIV-1 target CD4 cells by engaging to their surface receptors, which starts the process of the virus entering and taking control of the cell’s functions as shown in Fig. 1. New viruses are created as a result, ultimately the CD4 cells are destroyed HIV causes the immune system to become extremely weakened by destroying CD4 cells, which sets off a series of immune issues. Gradually, this causes CD4 cell depletion. The immune system’s capacity to mount effective defenses against infections is weakened by a decrease in CD4 cells [1, 3,4,5]. Breast milk, vaginal fluids, rectal fluids, semen and blood represent some of the bodily fluids that may transmit the virus. These bodily fluids can spread HIV when persons engage in risky sexual behavior, share needles with injecting drug users, or are pregnant, giving birth, or nursing a kid [6]. The goal of antiviral therapy is to stop HIV-1 replication in order to protect CD4 cell levels and immune system health [7]. An extensive variety of drugs, including Rilpivirine, Nevirapine, Emtricitabine, Delavirdine, Elvitegravir, Ritonavir, Saquinavir, Indinavir, and Bictegravir (these drugs are referred to as a, b, c,…., i respectively, as shown in Fig. 2 and their molecular graphs represented in Fig. 3) are required to cure HIV-1. These drugs are used to treat HIV-1 infection and stop the HIV virus from growing and from spreading throughout the body by a number of distinct mechanisms. By doing this, these drugs contribute to the regulation of HIV levels in the blood, which protects CD4 cells. In the area of HIV-1 analysis, graph theory provides a fundamental statistical application particularly in the field of chemistry and drugs development. Some embedding’s of drugs and diseases through the dual-channel network are characterized in [8,9,10,11]. On the other hand, the bridges between largest herbal medicines, chemical ingredients, target proteins, and associated diseases with respect to the neural network and deep learning-based invariants are discussed in [12,13,14,15,16,17].

Fig. 1
figure 1

Virus entering and functioning

Fig. 2
figure 2

Molecular structure of antiviral HIV-1 drugs

Fig. 3
figure 3

Molecular graph of emtricitabine, nevirapine and elviteravir

Graph theory is essential to the analysis of biochemical networks in medicine, including drug-target relationships and protein–protein interactions [18,19,20,21,22]. To aid in the identification of possible drug candidates and the optimization of drug design, graphs depict pharmaceuticals as nodes and their interactions with targets as edges. Furthermore, proteins are shown as nodes in graphs that represent protein–protein interactions as edges. This makes it possible to identify important protein hubs and pathways that are connected to disease causes and potential treatment approaches. Topological indices (TIs) from graph theory are essential for drugs discovery [23,24,25].

Our main goal is to conduct an extensive review of nine selected antiviral drugs for HIV-1. Using Python algorithm, which involves finding their degree base TIs such as (Randic, Sum Connectivity, First Zagreb, Second Zagreb) Indices which shown in Table 1 by developing python algorithm based on graph theory. Python programs are essential resources for researchers examining the chemical properties of drugs and computing topological indices. In addition to improving analytical efficiency by automating repetitive processes and quickly processing enormous data sets, the computational approach offers substantial benefits in the simultaneous research of many drugs. By revealing complex links between molecular descriptors and biological activities, the integration of physio-chemical characteristics such as molecular weight (MW), complexity (Comp), density (Den), flash point (FP), molar volume (MV), surface tension (ST), polarizability (Pol), boiling point (BP) and enthalpy of vaporization (EV) into the study through machine learning algorithms, contributes to our understanding of the potential efficacy and safety profiles of drugs against HIV. In order to provide a thorough understanding of the molecular characteristics of HIV drugs and to provide insights into their modes of action and potential side effects, it is imperative to combine topological indices with physio-chemical parameters. It is essential to combine topological indices with physio-chemical parameters to provide a comprehensive understanding of the molecular properties of HIV drugs, as well as insights into their modes of action and potential adverse effects. In order to predict drug efficacy based on molecular features, researchers utilize supervised machine learning models to establish quantitative correlations between calculated molecular descriptors and observed biological activity. Supervised machine learning predictive models offer valuable insights into the potential efficacy of anti-HIV drug by analyzing their molecular properties and estimating their effectiveness against the illness. The utilization of Quantitative Structure–Property Relationship (QSPR) analysis is becoming increasingly important in understanding the relationships between drug structures and biological behavior [26,27,28,29,30]. QSPR analysis provides a rational framework for drug design and optimization [31,32,33]. By combining computational methods and QSPR analysis, researchers hope to obtain a deeper understanding of the molecular mechanisms underlying anti-HIV drugs, which will help in the development of more focused and efficient treatment options.

Table 1 Topological indices with notations and formula

Material and method

We initially determined the edge partition based on graph connectivity was adopted to define molecular graphs, which is an important step in recognizing the structural properties. Then, degree-based TIs were calculated through analyzing the molecular graph’s node degree variation. To make this process easier, a unique Python algorithm was developed. After that, Python programs were used to develop machine learning methods for the analysis of physiochemical properties. Furthermore, using Statistical Package for the Social Sciences (SPSS) software to analyze relationships between the computed indices and experimental features, we also performed graphical comparison analysis between actual and computed drug property, ensuring the accuracy and credibility of our results.

Data acquisition and preparation

  • We utilized the latest version of python 3.12 to compute topological indices and sourced physiochemical properties from online database Chemspider (https://www.chemspider.com) and Pubchem (https://pubchem.ncbi.nlm.nih.gov). The topological descriptors were employed as feature variables (input variables), while the physiochemical properties served as target variables. Our analysis covered a dataset composed of multiple feature variables and target variables, representing a considerable amount of data points.

  • Given that our dataset is labeled, we opted for supervised machine learning algorithms, specifically Random Forest and XGBoost, to analyze the data and derive insights. RF is chosen for its proficiency in handling overfitting through its ensemble approach, where multiple decision trees contribute to a more stable and accurate prediction while XGBoost is based on the gradient boosting framework, which builds one tree at a time. Each new tree helps to correct errors made by previously trained tree models. By averaging several trees, Random Forest reduces the risk of overfitting, which is common with single decision trees while XGBoost is based on the gradient boosting framework, which builds one tree at a time. Each new tree helps to correct errors made by previously trained tree models.

  • The primary libraries utilized for Random Forest and XGBoost are:

    • “pandas” for data manipulation,

    • “numpy” for numerical operations,

    • “scikit-learn” for machine learning algorithms, including Random Forest and XGBoost,

    • “matplotlib” and “seaborn” for data visualization,

    • Computational resources: the computations were performed on a machine with an Intel core i7 processor and 16 GB of RAM.

Results and discussion

Theorem 1

Let G be a graph and G1 denotes the elvitegravir, then the following axioms holds for the graph G1:

(a) M1 (G1) = 162; (b) M2 (G1) = 195; (c) H (G1) = 13.966; (d) F (G1) = 432; (e) SS (G1) = 35.088; (f) ABC (G1) = 23.695; (g) RI (G1) = 14.688; (h) SC (G1) = 15.1037; (i) GA (G1) = 131.705; (j) HZ (G1) = 822; (k) ReZG1 (G1) = 37.983; (l) ReZG2 (G1) = 1028.

Proof

Suppose that Gramicidin S is represented by G1, where Er,s is the set of edges connecting vertices in the graph with corresponding degrees r and s. Between vertices of degrees r and s, the frequencies |Er,s| show the number of edges. The expression |E1,2| = 2 denotes two edges present between the vertices of degree 1 and 2, while the expression |E1,3| = 7 denotes eighteen edges present between the vertices of degree 1 and 3. Similarly, |E2,2| = 2, |E2,3| = 12 |E3,3| = 10. Then,

  1. a)

    By using First Zagreb Index

    $$M_{{1}} \;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left( {dr + ds} \right),$$
    $$\begin{aligned} M_{{1}} \;\left( {G_{{1}} } \right) & = {2}\left( {{1} + {2}} \right) + {7}\left( {{1} + {3}} \right) + {2}\left( {{2} + {2}} \right) + {12}\left( {{2} + {3}} \right) + {1}0\left( {{3} + {3}} \right) \\ & = {2} \times {3} + {7} \times {4} + {2} \times {4} + {12} \times {5} + {1}0 \times {6} = {162}{\text{.}} \\ \end{aligned}$$
  2. b)

    By using Second Zagreb Index

    $$M_{{2}} \;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left( {dr \times ds} \right),$$
    $$\begin{aligned} M_{{2}} \;\left( {G_{{1}} } \right) & = {2}({1} \times {2}) + {7}\left( {{1} \times {3}} \right) + {2}\left( {{2} \times {2}} \right) + {12}\left( {{2} \times {3}} \right) + {1}0\left( {{3} \times {3}} \right) \\ & = {2} \times {2} + {7} \times {3} + {2} \times {4} + {12} \times {6} + {1}0 \times {9} = {195}{\text{.}} \\ \end{aligned}$$
  3. c)

    By using Forgotten Index

    $$H\;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \frac{2}{{\left( {dr + ds} \right)}},$$
    $$\begin{aligned} H\;\left( G \right) & = 2\frac{2}{1 + 2} + 7\frac{2}{1 + 3} + 2\frac{2}{2 + 2} + 12\frac{2}{2 + 3} + 10\frac{2}{3 + 3} \\ & = 2\frac{2}{3} + 7\frac{2}{4} + 2\frac{2}{4} + 12\frac{2}{5} + 10\frac{2}{6} = {13}.{966}{\text{.}} \\ \end{aligned}$$
  4. d)

    By using Forgotten Index

    $$F\;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left[ {\left( {dr} \right)^{2} + \left( {ds} \right)^{2} } \right],$$
    $$\begin{aligned} {\text{F}}\;\left( {{\text{G1}}} \right) & = [{2}\left( {{12} + {22}} \right) + {7}\left( {{12} + {32}} \right) + {2}\left( {{22} + {22}} \right) + {12}\left( {{22} + {32}} \right) + {1}0\left( {{32} + {32}} \right) \\ & { = 2} \times {5} + {7} \times {1}0 + {2} \times {8} + {12} \times {13} + {1}0 \times {18} = {432}{\text{.}} \\ \end{aligned}$$
  5. e)

    By using Shilpa-Shanmukha Index

    $${\text{SS}}\;\left( {{\text{G1}}} \right) = \mathop \sum \limits_{{{\text{rs}} \in {\text{E}}\left( {\text{G}} \right)}} \sqrt {\frac{{{\text{dr}} \times {\text{ds}}}}{{{\text{dr}} + {\text{ds}}}}} ,$$
    $$\begin{aligned} {\text{SS}}\;\left( {{\text{G1}}} \right) & = 2\sqrt {\frac{1 \times 2}{{1 + 2}}} + 7\sqrt {\frac{1 \times 3}{{1 + 3}} + } { }2\sqrt {\frac{2 \times 2}{{2 + 2}}} + { }12\sqrt {\frac{2 \times 3}{{2 + 3}}} { } + { }10\sqrt {\frac{3 \times 3}{{3 + 3}}} \\ & = 2\sqrt{\frac{2}{3}} + 7\sqrt {\frac{3}{4} + } { }2\sqrt{\frac{4}{4}} + { }12\sqrt{\frac{6}{5}} { } + { }10\sqrt{\frac{9}{6}} = {35}.0{88}{\text{.}} \\ \end{aligned}$$
  6. f)

    By using Randic Index

    $$RI\;\left( {G_{1} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \sqrt {\frac{1}{dr \times ds}} ,$$
    $$\begin{aligned} RI\;\left( {G_{1} } \right) & = 2\sqrt {\frac{1}{1 \times 2}} + 7\sqrt {\frac{1}{1 \times 3} + } 2\sqrt {\frac{1}{2 \times 2}} + 12\sqrt {\frac{1}{2 \times 3}} + 10\sqrt {\frac{1}{3 \times 3}} \\ & = 2\sqrt{\frac{1}{2}} + 7\sqrt {\frac{1}{3} + } 2\sqrt{\frac{1}{4}} + 12\sqrt{\frac{1}{6}} + 10\sqrt{\frac{1}{9}} = {14}.{688}{\text{.}} \\ \end{aligned}$$
  7. g)

    By using Sum Connectivity Index

    • \(\bullet \qquad {\text{SC}}\;\left( {{\text{G}}_{{1}} } \right) = \sum \limits_{rs \in E\left( G \right)} \sqrt {\frac{1}{dr + ds}} ,\)

      $$\begin{aligned} SC\;\left( {G_{1} } \right) & = 2\sqrt {\frac{1}{1 + 2}} + 7\sqrt {\frac{1}{1 + 3} + } 2\sqrt {\frac{1}{2 + 2}} + 12\sqrt {\frac{1}{2 + 3}} + 10\sqrt {\frac{1}{3 + 3}} \\ & = 2\sqrt{\frac{1}{3}} + 7\sqrt {\frac{1}{4} + } 2\sqrt{\frac{1}{4}} + 12\sqrt{\frac{1}{5}} + 10\sqrt{\frac{1}{6}} = {15}.{1}0{37}{\text{.}} \\ \end{aligned}$$
  8. h)

    By using Geometric Arithmetic Index

    • \(\bullet \qquad{\text{GA}}\;\left( {{\text{G}}_{{1}} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} 2\frac{{\sqrt {dr \times ds} }}{dr + ds},\)

      $$\begin{aligned} GA\;\left( {G_{1} } \right) & = 2 \times 2\frac{{\sqrt {1 \times 2} }}{1 + 2} + 2 \times 7\frac{{\sqrt {1 \times 3} }}{1 + 3} + 2 \times 2\frac{{\sqrt {2 \times 2} }}{2 + 2} + 2 \times 12\frac{{\sqrt {2 \times 3} }}{2 + 3} + 2 \times 10\frac{{\sqrt {3 \times 3} }}{3 + 3} \\ & = 4\frac{\sqrt 2 }{2} + 14\frac{\sqrt 3 }{4} + 4\frac{\sqrt 4 }{4} + 24\frac{\sqrt 6 }{5} + 20\frac{\sqrt 9 }{6} = {31}.{7}0{53}{\text{.}} \\ \end{aligned}$$
  9. i)

    By using Hyper Zagreb Index

    • \(\bullet \qquad{\text{HZ}}\;\left( {{\text{G}}_{{1}} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left( {dr + ds} \right)^{2} ,\)

      $$\begin{aligned} HZ\;\left( {G_{1} } \right) & = \left[ {{2}\left( {{1} + {2}} \right)^{{2}} + {7}\left( {{1} + {3}} \right)^{{2}} + {2}\left( {{2} + {2}} \right)^{{2}} + {12}\left( {{2} + {3}} \right)^{{2}} + {1}0\left( {{3} + {3}} \right)^{{2}} } \right] \\ & = {2}\left( {3} \right)^{{2}} + {7}\left( {4} \right)^{{2}} + {2}\left( {4} \right)^{{2}} + {12}\left( {5} \right)^{{2}} + {1}0\left( {6} \right)^{{2}} = {822}{\text{.}} \\ \end{aligned}$$
  10. j)

    By using Redefined First Zagreb Index

    $$ReZ_{1} \;\left( {G_{1} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \frac{{\left( {dr \times ds} \right)}}{{\left( {dr + ds} \right)}},$$
    $$\begin{aligned} ReZ_{1} \;\left( {G_{1} } \right) & = 2\frac{1 \times 2}{{1 + 2}} + 7\frac{1 \times 3}{{1 + 3}} + 2\frac{2 \times 2}{{2 + 2}} + 12\frac{2 \times 3}{{2 + 3}} + 10\frac{3 \times 3}{{3 + 3}} \\ & = 2\frac{2}{3} + 7\frac{3}{4} + 2\frac{4}{4} + 12\frac{6}{5} + 10\frac{9}{6} = {37}.{9833}{\text{.}} \\ \end{aligned}$$
  11. k)

    By using Redefined Second Zagreb Index

    • \(\bullet \qquad {\text{ReZ}}_{{2}} \left( {{\text{G}}_{{1}} } \right) = \sum \limits_{rs \in E\left( G \right)} \left( {dr \times ds} \right)\left( {dr + ds} \right)\)

      $$\begin{aligned} & = 2({1} \times {2})\;\left( {{1} + {2}} \right) + {7}\left( {{1} \times {3}} \right)\;\left( {{1} + {3}} \right) + {2}\left( {{2} \times {2}} \right)\;\left( {{2} + {2}} \right) + {12}\left( {{2} \times {3}} \right)\;\left( {{2} + {3}} \right) + {1}0\left( {{3} \times {3}} \right)\;\left( {{3} + {3}} \right) \\ & = {2} \times {2} \times {3} + {7} \times {3} \times {4} + {2} \times {4} \times {4} + {12} \times {6} \times {5} + {1}0 \times {9} \times {6} = {1028,} \\ \end{aligned}$$

Remark 3.2

The topological indices of other drugs can be obtained using a similar technique as that used in Theorem 1 and their output is provided in Table 2.

Table 2 The topological indices values for the candidate drugs

Although a lot of scholars are already calculating topological indices [43,44,45,46], we contribute by creating an efficient Python program (see Algorithm 1) to compute these indices. Especially, our technique can quickly compute through integrating edge partition values for every molecular graph in an elegant and seamless manner. This Python method advances the field with its efficiency by providing simplified procedures, improved accuracy and time saving for computing topological indices.

Algorithm 1
figure a

.

Theorem 1 and Algorithm 1 can both be used to compute topological indices; however algorithmic approach is more effective and beneficial in this respect. Moreover, Table 3 shows the physio-chemical properties of selected drugs collected from ChemSpider [47] and PubChem [48] and the computed TIs obtained from their molecular structures by developing python algorithm respectively as seen above.

Table 3 The properties of drugs related to their physical characteristics

Supervised machine learning

Within the field of artificial intelligence, machine learning focuses on creating statistical models and algorithms that allow computers to learn and make decisions without explicit programming. The development of drugs usually involves machine learning techniques like Random Forest Algorithm (RFA), Extreme Gradient Boosting (XGB), and linear analysis. Linear analysis techniques like linear regression are helpful for simpler, easier-to-understand models, ensemble learning techniques like XGB and RFA are capable of managing complex nonlinear correlations and interactions in data.

Random forest

For machine learning tasks including regression, RFA is a potent ensemble learning technique. During training, it builds a large number of decision trees, and it produces the mean prediction (regression) of each individual tree. In order to begin, RF bootstraps a technique many random sections of the training set. A decision tree is trained using each subset, also referred to as a bootstrap sample. At every split point, a decision tree is built for every bootstrap sample using a random subset of features. The model performs better overall because of this randomness, which aids in decorrelation between the trees. Without any pruning, each tree is grown to its fullest depth. When every tree is constructed, its predictions are combined using the Random Forest algorithm. The following is a mathematical representation of the prediction formula for regression:

$${\text{Y}}^{\prime} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} y\left( i \right),$$

where Y′ is the predicted output, y1, y2,…..yn are the predicted outputs from individual decision trees, and n is the total number of trees in the Random Forest. Figure 4 represent the feature importance of some physiochemical properties w.r.t topological indices; also Figs. 5 and 6 illustrate the decision trees.

Fig. 4
figure 4

Graphical representation of feature importance of MW and Den w.r.t TIs

Fig. 5
figure 5

Decision trees for BP

Fig. 6
figure 6

Random forest algorithm based violin distribution plot

Violin plots highlight gaps in the data distribution and help evaluate the accuracy of predictions against actual values graphically as shown in Figs. 7, 8 and 9. RFA output error measures are shown in Table 4 and include specific parameters like Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). The following formulas can be used to determine MAE, MSE, and RMSE:

  • \(\bullet \qquad{\text{MAE}} = \frac{1}{{\text{n}}}\sum \left| {{\text{actual}} - {\text{predicted}}} \right|,\)

  • \(\bullet \qquad{\text{MSE}} = \frac{1}{{\text{n}}}\sum \left( {{\text{actual}} - {\text{predicted}}} \right)^{2} ,\)

  • \(\bullet \qquad{\text{RMSE}} = \frac{1}{{\text{n}}}\sqrt {\left( {\sum \left( {{\text{actual}} - {\text{predicted}}} \right)^{2} } \right)} .\)

Fig. 7
figure 7

Decision trees for Den

Fig. 8
figure 8

Random forest algorithm based violin distribution plot

Fig. 9
figure 9

Random forest algorithm based violin distribution plot

Table 4 Random forest error measurement

The random forest algorithm’s performance and prediction accuracy were examined through information gained from both the violin plots and tables.

Linear regression

Linear regression is a fundamental supervised machine learning technique that predicts the connection between dependent variable and one or more independent variables. These models quantify the relationship between drug structures and their medical impacts through the use of various components, such as TIs. The QSPR results are constantly provided by the regression equation, which makes it an invaluable formula that is P = X + Y (TI). Here, P is the physiochemical parameter of a potential drug. Topological index, constant and regression coefficient are indicated by the symbols TI, X and Y respectively. The correlation coefficients between each topological indicator and the nine physio-chemical parameters are calculated and shown in Table 5 while, bar graph representing the correlation coefficients of all physio-chemical properties across different topological indices is shown in Fig. 10. Linear regression equations and physio-chemical properties w.r.t TIs derived below.

Table 5 Correlation coefficients of TI w.r.t to different physiochemical properties
Fig. 10
figure 10

Correlation coefficients of all physio-chemical properties w.r.t Tis

Linear regression models w.r.t TIs

Regression model for [M2(G)]

Regression model for M1(G)

Molecular weight = 22.1100 + 2.1165[M2(G)]

Molecular weight = 20.3377 + 2.4977[M1(G)]

Complexity = 0.2808 + 3.5990[M2(G)]

Complexity = 16.339 + 4.1416[M1(G)]

Density = 1.7482 − 0.0016[M2(G)]

Density = 1.7593 − 0.0020[M1(G)]

Flash point = 46.9325 + 1.5793[M2(G)]

Flash point = 45.2833 + 1.8655[M1(G)]

Molar volume = − 45.305 + 1.8798[M1(G)]

Molar volume = − 50.941 + 2.2409[M1(G)]

Surface tension = 79.4937 − 0.0705[M2(G)]

Surface tension = 79.4230 − 0.0825[M1(G)]

Polarizability = − 1.6054 + 0.2440[M2(G)]

Polarizability = − 2.3993 + 0.2913[M1(G)]

Boiling point = 154.073 + 2.6109[M2(G)]

Boiling point = 151.346 + 3.0842[M1(G)]

Enthalpy of variation = 32.0454 + 0.3643[M2(G)]

Enthalpy of variation = 31.5778 + 0.4309[M1(G)]

Regression model for F(G)

Regression model for H(G)

Molecular weight = 56.1270 + 0.8636[F(G)]

Molecular weight = 14.0675 + 29.5447[H(G)]

Complexity = 42.5485 + 1.5009[F(G)]

Complexity = 50.3265 + 46.1198[H(G)]

Density = 1.7062 − 0.0006[F(G)]

Density = 1.79322 − 0.0254[H(G)]

Flash point = 68.2462 + 0.6529[F(G)]

Flash point = 44.9218 + 21.7875[H(G)]

Molar volume = − 11.6411 + 0.7598[F(G)]

Molar volume = − 66.0938 + 27.1224[H(G)]

Surface tension = 77.8286 − 0.0277[F(G)]

Surface tension = 80.3293 − 1.0205[H(G)]

Polarizability = 2.6997 + 0.0987[F(G)]

Polarizability = − 4.2840 + 3.5197[H(G)]

Boiling point = 189.3117 + 1.0793[F(G)]

Boiling point = 150.7451 + 36.0198[H(G)]

Enthalpy of variation = 36.8359 + 0.1509[F(G)]

Enthalpy of variation = 31.3317 + 5.0423[H(G)]

Regression model for ABC

Regression model for SS(G)

Molecular weight = 20.3827 + 16.9418[ABC(G)]

Molecular weight = 9.7915 + 11.7757[SS(G)]

Complexity = 37.6722 + 27.2928[ABC(G)]

Complexity = 16.8938 + 19.0652[SS(G)]

Density = 1.7710 − 0.0140[ABC(G)]

Density = 1.7819 − 0.0098[SS(G)]

Flash point = 46.5920 + 12.6059[ABC(G)]

Flash point = 39.4157 + 8.7440[SS(G)]

Molar volume = − 55.2781 + 15.3641[ABC(G)]

Molar volume = − 64.516 + 10.6698[SS(G)]

Surface tension = 79.6061 − 0.5662[ABC(G)]

Surface tension = 80.0087 − 0.3948[SS(G)]

Polarizability = − 2.9516 + 1.9965[ABC(G)]

Polarizability = − 4.1564 + 1.3866[SS(G)]

Boiling point = 153.5087 + 20.8404[ABC(G)]

Boiling point = 141.6433 + 14.4558[SS(G)]

Enthalpy of variation = 31.769 + 2.915[ABC(G)]

Enthalpy of variation = 30.2478 + 2.0188[SS(G)]

Regression model for SC

Regression model for RI

Molecular weight = 13.3437 + 27.1882[SC(G)]

Molecular weight = 16.2195 + 28.1100[RI(G)]

Complexity = 39.4253 + 43.0217[SC(G)]

Complexity = 47.5607 + 44.2587[RI(G)]

Density = 1.7876 − 0.0230[SC(G)]

Density = 1.7843 − 0.0238[RI(G)]

Flash point = 43.2542 + 20.1170[SC(G)]

Flash point = 45.7931 + 20.7737[RI(G)]

Molar volume = − 64.6865 + 24.836[SC(G)]

Molar volume = − 62.3854 + 25.6982[RI(G)]

Surface tension = 80.1367 − 0.9262[SC(G)]

Surface tension = 80.1996 − 0.9676[RI(G)]

Polarizability = − 4.1379 + 3.2252[SC(G)]

Polarizability = − 3.7882 + 3.3340[RI(G)]

Boiling point = 147.989 + 33.2581[SC(G)]

Boiling point = 152.1868 + 34.3436[RI(G)]

Enthalpy of variation = 30.9983 + 4.6526[SC(G)]

Enthalpy of variation = 31.4694 + 4.8117[RI(G)]

Regression model for HZ

Regression model for GA

Molecular weight = 123.0392 + 0.3994[HZ(G)]

Molecular weight = 9.4468 + 13.0085[GA(G)]

Complexity = 170.448 + 0.6808[HZ(G)]

Complexity = 28.0960 + 20.7296[GA(G)]

Density = 1.6200 − 0.0003[HZ(G)]

Density = 1.7901 − 0.0110[GA(G)]

Flash point = 128.5374 + 0.2908[HZ(G)]

Flash point = 40.1194 + 9.6328[GA(G)]

Molar volume = 52.1288 + 0.3458[HZ(G)]

Molar volume = − 67.435 + 11.8602[GA(G)]

Surface tension = 76.9305 − 0.0142[HZ(G)]

Surface tension = 80.1620 − 0.4401[GA(G)]

Polarizability = 11.1992 + 0.0447[HZ(G)]

Polarizability = − 4.5259 + 1.5410[GA(G)]

Boiling point = 288.9813 + 0.4807[HZ(G)]

Boiling point = 142.8059 + 15.9243[GA(G)]

Enthalpy of variation = 49.5430 + 0.0686[HZ(G)]

Enthalpy of variation = 30.377 + 2.2248[GA(G)]

Regression model for ReZ2

Regression model for ReZ1

Molecular weight = 57.3870 + 0.3780[ReZ2(G)]

Molecular weight = 6.6707 + 11.0317[ReZ1(G)]

Complexity = 29.1286 + 0.6712[ReZ2(G)]

Complexity = 3.7703 + 18.0524[ReZ1(G)]

Density = 1.6995 − 0.0003[ReZ2(G)]

Density = 1.7808 − 0.0091[ReZ1(G)]

Flash point = 71.4891 + 0.2836[ReZ2(G)]

Flash point = 37.1349 + 8.1906[ReZ1(G)]

Molar volume = − 7.2151 + 0.3295[ReZ2(G)]

Molar volume = − 65.6712 + 9.9559[ReZ1(G)]

Surface tension = 78.3134 − 0.0126[ReZ2(G)]

Surface tension = 80.1156 − 0.3699[ReZ1(G)]

Polarizability = 3.4246 + 0.0427[ReZ2(G)]

Polarizability = − 4.2984 + 1.2936[ReZ1(G)]

Boiling point = 194.6729 + 0.4689[ReZ2(G)]

Boiling point = 137.8725 + 13.5410[ReZ1(G)]

Enthalpy of variation = 37.6498 + 0.0655[ReZ2(G)]

Enthalpy of variation = 29.7872 + 1.8895[ReZ1(G)]

Computation of statistical parameters

The use of statistical parameters to compare Topological Indices (TIs) with characteristic of correlation coefficients is useful in model analysis. In a regression model, the standard error (SE) of the estimate measures the mean variance of expected outcomes from actual values, Tables 6, 7 and 8 shows the SE, F-statistics and significance p values. Furthermore, comparison graphs through Figs. 11, 12, 13, 14, 15, 16, 17, 18 and 19 include both actually acquired and mathematically derived physio-chemical property values from regression models.

Table 6 Statistical parameter SE of selected TI w.r.t to different physiochemical properties
Table 7 Statistical parameter F of selected TI w.r.t to different physiochemical properties
Table 8 Statistical parameter P of selected TI w.r.t to different physiochemical properties
Fig. 11
figure 11

Graphical comparison w.r.t linear regression for MW

Fig. 12
figure 12

Graphical comparison w.r.t linear regression for Comp

Fig. 13
figure 13

Graphical comparison w.r.t linear regression for Den

Fig. 14
figure 14

Graphical comparison w.r.t linear regression for FP

Fig. 15
figure 15

Graphical comparison w.r.t linear regression for MV

Fig. 16
figure 16

Graphical comparison w.r.t linear regression for ST

Fig. 17
figure 17

Graphical comparison w.r.t linear regression for Pol

Fig. 18
figure 18

Graphical comparison w.r.t linear regression for BP

Fig. 19
figure 19

Graphical comparison w.r.t linear regression for EV

Additionally, the majority of p-values are less than 0.05 a specific value, and mostly r exceeds 0.6 on a consistent basis as seen in Table 4.

Extreme gradient boosting

Extreme Gradient Boosting, is a powerful machine learning method that is well-known for its efficiency in predictive mathematical modeling, here we provided Pseudo-code namely Algorithm-2, provides useful information about XGB, including information about its flexibility and adaptability. The distributions plot of the actual and predicted values are shown in Figs. 20, 21 and 22, which are essential for evaluating the effectiveness of the model and detecting any variations. Furthermore aiding in our understanding is the violin plot, which displays the data distribution graphically while highlighting the peculiarities specific to XGB. Table 9 also offers error estimates, which helps towards a comprehensive review of the model’s predictive power and general accuracy while using XGB algorithm, having a well-organized overview of implementation procedures like the one provided by pseudo-code proves invaluable for expediting the process and improving understanding of its complexities.

Fig. 20
figure 20

XGB algorithm based violin distribution plot of MW, Comp and Den

Fig. 21
figure 21

XGB algorithm based violin distribution plot of FP, MV and ST

Fig. 22
figure 22

XGB algorithm based violin distribution plot of Pol, BP and EV

Table 9 XGB error measurement
Algorithm 2 XGB for QSPR model of anti-HIV

Physio-chemical parameters comparison analysis

When XGB and RFA were used to forecast the physio-chemical properties of anti-HIV medicines, the results showed that XGB predictions consistently produced higher values than RFA. This implies that when it comes to the physio-chemical characteristics of anti-HIV drugs, the XGB algorithm typically yields more optimistic forecasts.

Even though these two machine learning models provide insightful information about the structure–activity relationship of associated drugs, the difference in predicted values emphasizes how crucial it is to take into account a variety of computational strategies and validation methods in order to guarantee the precision and dependability of predictions made during the drug discovery and development process. Tables 10 and 11 are the Experimental and actual data for prediction of RFA and XGB w.r.t physical properties as well as through Figs. 23, 24, 25, 26 and 27 shown the graphical comparison between XGB and RFA listed below.

Table 10 Drug properties predicted by the RFA
Table 11 Drug properties predicted by the XGB
Fig. 23
figure 23

Graphical comparison of MW

Fig. 24
figure 24

Graphical comparison of Den

Fig. 25
figure 25

Graphical comparison of MV

Fig. 26
figure 26

Graphical comparison of Pol

Fig. 27
figure 27

Graphical comparison of EV

Standard errors measurements like MAE, MSE, and RMSE are used to evaluate the performance of predictive models like RFA and XGB. To evaluate the relative efficiency of the models and compare the error indicators, visualizations such as tables and graphs were used. In terms of prediction accuracy, XGB performed better than RFA, as seen by lower MAE, MSE, and RMSE values. Furthermore, compared to RFA, greater R2 values for XGB demonstrated a better fit of the model to the data. It was easier to comprehend why XGB is such a strong algorithm for predictive modeling problems compared to the graphical representations and error tables.

Conclusions

The conclusion of our analysis gives information on the potential efficacy of the drugs under examination in treating HIV-1 disease. In order to predict physio-chemical properties, we compared ability to forecast of RFA, Linear Regression, and XGB in this work. Metrics including MAE, MSE, RMSE, and R2 values were used to assess their effectiveness. With substantially lower error rates and higher R2 values than the other models, XGB performed better. The efficacy of XGB was further demonstrated by graphical representations. Particularly in the treatment of HIV, the findings have important implications for drug development. Using machine learning algorithms such as XGB can improve drug property prediction efficiency. The superiority of XGB is derived from its iterative prediction refining. Some more techniques and data-set optimization may be investigated in future studies. The research contributes to larger-scale predictive modeling efforts in the pharmaceutical industry. The possibilities of predictive modeling will grow with further development of machine learning techniques. Overall, this work shows that advanced algorithms can be used to improve the drug development process.

Data availability

No datasets were generated or analysed during the current study.

References

  1. Khan MM, Khan MM. Acquired immune deficiency syndrome. In: Immunopharmacology. Cham: Springer; 2016. p. 293–330.

    Chapter  Google Scholar 

  2. Sellier P, et al. Updated mortality and causes of death in 2020–2021 in people with HIV: a multicenter study in France. AIDS. 2023;37(13):2007–13.

    Article  PubMed  Google Scholar 

  3. Okoye AA, Picker LJ. CD 4+ T-cell depletion in HIV infection: mechanisms of immunological failure. Immunol Rev. 2013;254(1):54–64.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Paiardini M, Müller-Trutwin M. HIV-associated chronic immune activation. Immunol Rev. 2013;254(1):78–101.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Veazey RS. Intestinal CD4 depletion in HIV/SIV infection. Curr Immunol Rev. 2019;15(1):76–91.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Wilson NL, et al. Identifying symptom patterns in people living with HIV disease. J Assoc Nurses AIDS Care. 2016;27(2):121–32.

    Article  PubMed  Google Scholar 

  7. Joseph SB, et al. HIV-1 target cells in the CNS. J Neurovirol. 2015;21:276–89.

    Article  CAS  PubMed  Google Scholar 

  8. Hu L, et al. Dual-channel hypergraph convolutional network for predicting herb–disease associations. Brief Bioinform. 2024;25(2): bbae067.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Zhao B-W, et al. Motif-aware miRNA-disease association prediction via hierarchical attention network. IEEE J Biomed Health Inform. 2024;28(7):4281–94.

    Article  PubMed  Google Scholar 

  10. Zhao B-W, et al. iGRLDTI: an improved graph representation learning method for predicting drug–target interactions over heterogeneous biological information network. Bioinformatics. 2023;39(8): btad451.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Zhao B-W, et al. A geometric deep learning framework for drug repositioning over heterogeneous information networks. Brief Bioinform. 2022;23(6): bbac384.

    Article  PubMed  Google Scholar 

  12. Lv Q, et al. TCMBank: bridges between the largest herbal medicines, chemical ingredients, target proteins, and associated diseases with intelligence text mining. Chem Sci. 2023;14(39):10684–701.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Lv Q, et al. TCMBank-the largest TCM database provides deep learning-based Chinese-Western medicine exclusion prediction. Signal Transduct Target Ther. 2023;8(1):127.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Lv Q, et al. Meta learning with graph attention networks for low-data drug discovery. IEEE Trans Neural Netw Learn Syst. 2023;35(8):11218–30.

    Article  Google Scholar 

  15. Lv Q, et al. Meta-molnet: a cross-domain benchmark for few examples drug discovery. IEEE Trans Neural Netw Learn Syst. 2024. https://doi.org/10.1109/TNNLS.2024.335965.

    Article  PubMed  Google Scholar 

  16. Lv Q, et al. Mol2Context-vec: learning molecular representation from context awareness for drug discovery. Brief Bioinform. 2021;22(6): bbab317.

    Article  PubMed  Google Scholar 

  17. Lv Q, et al. 3D graph neural network with few-shot learning for predicting drug–drug interactions in scaffold-based cold start scenario. Neural Netw. 2023;165:94–105.

    Article  PubMed  Google Scholar 

  18. Ahmed W, et al. A python based algorithmic approach to optimize sulfonamide drugs via mathematical modeling. Sci Rep. 2024;14(1):12264.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Zaman S, et al. On neighborhood eccentricity-based topological indices with QSPR analysis of PAHs drugs. Meas Interdiscip Res Perspect. 2024. https://doi.org/10.1080/15366367.2024.2329950.

    Article  Google Scholar 

  20. Ahmed W, et al. Molecular insights into anti-Alzheimer’s drugs through predictive modeling using linear regression and QSPR analysis. Modern Phys Lett B. 2024. https://doi.org/10.1142/S0217984924502609.

    Article  Google Scholar 

  21. Zaman S, et al. Mathematical modeling and topological graph description of dominating David derived networks based on edge partitions. Sci Rep. 2023;13(1):15159.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Zaman S, et al. Mathematical analysis and molecular descriptors of two novel metal–organic models with chemical applications. Sci Rep. 2023;13(1):5314.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Aqib M, et al. On topological indices of some chemical graphs. Mol Phys. 2023. https://doi.org/10.1080/00268976.2023.2276386.

    Article  Google Scholar 

  24. Bhatia KS, Gupta AK, Saxena AK. Physicochemical significance of topological indices: importance in drug discovery research. Curr Top Med Chem. 2023;23(29):2735–42.

    Article  CAS  PubMed  Google Scholar 

  25. Zanni R, et al. What place does molecular topology have in today’s drug discovery? Expert Opin Drug Discov. 2020;15(10):1133–44.

    Article  CAS  PubMed  Google Scholar 

  26. Ullah A, Bano Z, Zaman S. Computational aspects of two important biochemical networks with respect to some novel molecular descriptors. J Biomol Struct Dyn. 2024;42(2):791–805.

    Article  CAS  PubMed  Google Scholar 

  27. Ullah A, et al. Predictive potential of K-Banhatti and Zagreb type molecular descriptors in structure–property relationship analysis of some novel drug molecules. J Chin Chem Soc. 2024;71(3):250–76.

    Article  CAS  Google Scholar 

  28. Zaman S, et al. Three-dimensional structural modelling and characterization of sodalite material network concerning the irregularity topological indices. J Math. 2023;2023(1):5441426.

    Google Scholar 

  29. Zhang X, et al. The study of curve fitting models to analyze some degree-based topological indices of certain anti-cancer treatment. Chem Pap. 2024;78(2):1055–68.

    Article  CAS  Google Scholar 

  30. Meharban S, et al. Molecular structural modeling and physical characteristics of anti-breast cancer drugs via some novel topological descriptors and regression models. Curr Res Struct Biol. 2024;7: 100134.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Patel HM, et al. Quantitative structure–activity relationship (QSAR) studies as strategic approach in drug discovery. Med Chem Res. 2014;23:4991–5007.

    Article  CAS  Google Scholar 

  32. Zaman S, et al. QSPR analysis of some novel drugs used in blood cancer treatment via degree based topological indices and regression models. Polycycl Aromat Compd. 2023;44:1–17.

    Google Scholar 

  33. Hakeem A. et al. QSPR analysis of some novel drugs used for cardiovascular diseases through degree-based topological indices and regression models. 2023.

  34. Gutman I, Polansky OE. Mathematical concepts in organic chemistry. Berlin: Springer Science & Business Media; 2012.

    Google Scholar 

  35. Fajtlowicz S. On conjectures of Graffiti-II. Congr Numer. 1987;60:187–97.

    Google Scholar 

  36. Furtula B, Gutman I. A forgotten topological index. J Math Chem. 2015;53(4):1184–90.

    Article  CAS  Google Scholar 

  37. Zhao W, et al. Computing SS index of certain dendrimers. J Math. 2021;2021:1–14.

    Google Scholar 

  38. Ashraful Alam M, et al. Degree-based entropy for a non-kekulean benzenoid graph. J Math. 2022;2022:1–12.

    Article  Google Scholar 

  39. Gutman I, Furtula B, Katanić V. Randić index and information. AKCE Int J Graphs Comb. 2018;15(3):307–12.

    Article  Google Scholar 

  40. Farahani MR. On the Randic and sum-connectivity index of nanotubes. Ann West Univ Timisoara-Math Comput Sci. 2013;51(2):39–46.

    Google Scholar 

  41. Shirdel GH, Rezapour H, Sayadi AM. The hyper-zagreb index of graph operations. Iran J Math Chem. 2013;4(2):213–20.

    Google Scholar 

  42. Ranjini P, Lokesha V, Usha A. Relation between phenylene and hexagonal squeeze using harmonic index. Int J Graph Theory. 2013;1(4):116–21.

    Google Scholar 

  43. Havare ÖÇ. Topological indices and QSPR modeling of some novel drugs used in the cancer treatment. Int J Quantum Chem. 2021;121(24): e26813.

    Article  CAS  Google Scholar 

  44. Kirmani SAK, Ali P, Azam F. Topological indices and QSPR/QSAR analysis of some antiviral drugs being investigated for the treatment of COVID-19 patients. Int J Quantum Chem. 2021;121(9): e26594.

    Article  CAS  PubMed  Google Scholar 

  45. Gnanaraj LRM, Ganesan D, Siddiqui MK. Topological indices and QSPR analysis of NSAID drugs. Polycycl Aromat Compd. 2023;43(10):9479–95.

    Article  CAS  Google Scholar 

  46. Huang L, et al. Topological indices and QSPR modeling of new antiviral drugs for cancer treatment. Polycycl Aromat Compd. 2023;43(9):8147–70.

    Article  CAS  Google Scholar 

  47. Pence HE, Williams A. ChemSpider: an online chemical information resource. Washington, DC: ACS Publications; 2010.

    Google Scholar 

  48. Kim S, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–9.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

The authors extend their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2024-94).

Funding

This research was funded by Taif University, Saudi Arabia, Project No. TU-DSPP-2024-94).

Author information

Authors and Affiliations

Authors

Contributions

All the authors Wakeel Ahmed, Shahid Zaman, Eizzah Asif, Kashif Ali, Emad E. Mahmoud and Mamo Abebe Asheboss have equally contributed to this manuscript in all stages, from conceptualization to the write-up of final draft.

Corresponding author

Correspondence to Wakeel Ahmed.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

The manuscript has been approved by all authors and consent for publication has been granted.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahmed, W., Zaman, S., Asif, E. et al. Exploring the role of topological descriptors to predict physicochemical properties of anti-HIV drugs by using supervised machine learning algorithms. BMC Chemistry 18, 167 (2024). https://doi.org/10.1186/s13065-024-01266-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13065-024-01266-4

Keywords