- Research
- Open access
- Published:
Exploring the role of topological descriptors to predict physicochemical properties of anti-HIV drugs by using supervised machine learning algorithms
BMC Chemistry volume 18, Article number: 167 (2024)
Abstract
In order to explore the role of topological indices for predicting physio-chemical properties of anti-HIV drugs, this research uses python program-based algorithms to compute topological indices as well as machine learning algorithms. Degree-based topological indices are calculated using Python algorithm, providing important information about the structural behavior of drugs that are essential to their anti-HIV effectiveness. Furthermore, machine learning algorithms analyze the physio-chemical properties that correspond to anti-HIV activities, making use of their ability to identify complex trends in large, convoluted datasets. In addition to improving our comprehension of the links between molecular structure and effectiveness, the collaboration between machine learning and QSPR research further highlights the potential of computational approaches in drug discovery. This work reveals the mechanisms underlying anti-HIV effectiveness, which paves the way for the development of more potent anti-HIV drugs. This work reveals the mechanisms underlying anti-HIV efficiency, which paves the way for the development of more potent anti-HIV drugs which demonstrates the invaluable advantages of machine learning in assessing drug properties by clarifying the biological processes underlying anti-HIV behavior, which paves the way for the design and development of more effective anti-HIV drugs.
Introduction
Human Immunodeficiency Virus (HIV) was firstly identified in the early 1980s as a consequence of the appearance of an immune system-damaging disease [1]. Later on, the illness was identified as Acquired Immunodeficiency Syndrome (AIDS). In 1983–1984, French scientists Francoise Barre-Sinoussi and Luc Montagnier became essential in discovering the virus. HIV caused a global pandemic that has killed countless people and infected millions of people globally. Its impact on global health is immense, as it not only threatens human health but also affects economies and healthcare systems around the globe [2]. There are two primary types of HIV: HIV-1, which is common surrounding the world, and HIV-2, which is primarily linked to West Africa. Here we focus on HIV-1, HIV-1 target CD4 cells by engaging to their surface receptors, which starts the process of the virus entering and taking control of the cell’s functions as shown in Fig. 1. New viruses are created as a result, ultimately the CD4 cells are destroyed HIV causes the immune system to become extremely weakened by destroying CD4 cells, which sets off a series of immune issues. Gradually, this causes CD4 cell depletion. The immune system’s capacity to mount effective defenses against infections is weakened by a decrease in CD4 cells [1, 3,4,5]. Breast milk, vaginal fluids, rectal fluids, semen and blood represent some of the bodily fluids that may transmit the virus. These bodily fluids can spread HIV when persons engage in risky sexual behavior, share needles with injecting drug users, or are pregnant, giving birth, or nursing a kid [6]. The goal of antiviral therapy is to stop HIV-1 replication in order to protect CD4 cell levels and immune system health [7]. An extensive variety of drugs, including Rilpivirine, Nevirapine, Emtricitabine, Delavirdine, Elvitegravir, Ritonavir, Saquinavir, Indinavir, and Bictegravir (these drugs are referred to as a, b, c,…., i respectively, as shown in Fig. 2 and their molecular graphs represented in Fig. 3) are required to cure HIV-1. These drugs are used to treat HIV-1 infection and stop the HIV virus from growing and from spreading throughout the body by a number of distinct mechanisms. By doing this, these drugs contribute to the regulation of HIV levels in the blood, which protects CD4 cells. In the area of HIV-1 analysis, graph theory provides a fundamental statistical application particularly in the field of chemistry and drugs development. Some embedding’s of drugs and diseases through the dual-channel network are characterized in [8,9,10,11]. On the other hand, the bridges between largest herbal medicines, chemical ingredients, target proteins, and associated diseases with respect to the neural network and deep learning-based invariants are discussed in [12,13,14,15,16,17].
Graph theory is essential to the analysis of biochemical networks in medicine, including drug-target relationships and protein–protein interactions [18,19,20,21,22]. To aid in the identification of possible drug candidates and the optimization of drug design, graphs depict pharmaceuticals as nodes and their interactions with targets as edges. Furthermore, proteins are shown as nodes in graphs that represent protein–protein interactions as edges. This makes it possible to identify important protein hubs and pathways that are connected to disease causes and potential treatment approaches. Topological indices (TIs) from graph theory are essential for drugs discovery [23,24,25].
Our main goal is to conduct an extensive review of nine selected antiviral drugs for HIV-1. Using Python algorithm, which involves finding their degree base TIs such as (Randic, Sum Connectivity, First Zagreb, Second Zagreb) Indices which shown in Table 1 by developing python algorithm based on graph theory. Python programs are essential resources for researchers examining the chemical properties of drugs and computing topological indices. In addition to improving analytical efficiency by automating repetitive processes and quickly processing enormous data sets, the computational approach offers substantial benefits in the simultaneous research of many drugs. By revealing complex links between molecular descriptors and biological activities, the integration of physio-chemical characteristics such as molecular weight (MW), complexity (Comp), density (Den), flash point (FP), molar volume (MV), surface tension (ST), polarizability (Pol), boiling point (BP) and enthalpy of vaporization (EV) into the study through machine learning algorithms, contributes to our understanding of the potential efficacy and safety profiles of drugs against HIV. In order to provide a thorough understanding of the molecular characteristics of HIV drugs and to provide insights into their modes of action and potential side effects, it is imperative to combine topological indices with physio-chemical parameters. It is essential to combine topological indices with physio-chemical parameters to provide a comprehensive understanding of the molecular properties of HIV drugs, as well as insights into their modes of action and potential adverse effects. In order to predict drug efficacy based on molecular features, researchers utilize supervised machine learning models to establish quantitative correlations between calculated molecular descriptors and observed biological activity. Supervised machine learning predictive models offer valuable insights into the potential efficacy of anti-HIV drug by analyzing their molecular properties and estimating their effectiveness against the illness. The utilization of Quantitative Structure–Property Relationship (QSPR) analysis is becoming increasingly important in understanding the relationships between drug structures and biological behavior [26,27,28,29,30]. QSPR analysis provides a rational framework for drug design and optimization [31,32,33]. By combining computational methods and QSPR analysis, researchers hope to obtain a deeper understanding of the molecular mechanisms underlying anti-HIV drugs, which will help in the development of more focused and efficient treatment options.
Material and method
We initially determined the edge partition based on graph connectivity was adopted to define molecular graphs, which is an important step in recognizing the structural properties. Then, degree-based TIs were calculated through analyzing the molecular graph’s node degree variation. To make this process easier, a unique Python algorithm was developed. After that, Python programs were used to develop machine learning methods for the analysis of physiochemical properties. Furthermore, using Statistical Package for the Social Sciences (SPSS) software to analyze relationships between the computed indices and experimental features, we also performed graphical comparison analysis between actual and computed drug property, ensuring the accuracy and credibility of our results.
Data acquisition and preparation
-
We utilized the latest version of python 3.12 to compute topological indices and sourced physiochemical properties from online database Chemspider (https://www.chemspider.com) and Pubchem (https://pubchem.ncbi.nlm.nih.gov). The topological descriptors were employed as feature variables (input variables), while the physiochemical properties served as target variables. Our analysis covered a dataset composed of multiple feature variables and target variables, representing a considerable amount of data points.
-
Given that our dataset is labeled, we opted for supervised machine learning algorithms, specifically Random Forest and XGBoost, to analyze the data and derive insights. RF is chosen for its proficiency in handling overfitting through its ensemble approach, where multiple decision trees contribute to a more stable and accurate prediction while XGBoost is based on the gradient boosting framework, which builds one tree at a time. Each new tree helps to correct errors made by previously trained tree models. By averaging several trees, Random Forest reduces the risk of overfitting, which is common with single decision trees while XGBoost is based on the gradient boosting framework, which builds one tree at a time. Each new tree helps to correct errors made by previously trained tree models.
-
The primary libraries utilized for Random Forest and XGBoost are:
-
“pandas” for data manipulation,
-
“numpy” for numerical operations,
-
“scikit-learn” for machine learning algorithms, including Random Forest and XGBoost,
-
“matplotlib” and “seaborn” for data visualization,
-
Computational resources: the computations were performed on a machine with an Intel core i7 processor and 16 GB of RAM.
-
Results and discussion
Theorem 1
Let G be a graph and G1 denotes the elvitegravir, then the following axioms holds for the graph G1:
(a) M1 (G1) = 162; (b) M2 (G1) = 195; (c) H (G1) = 13.966; (d) F (G1) = 432; (e) SS (G1) = 35.088; (f) ABC (G1) = 23.695; (g) RI (G1) = 14.688; (h) SC (G1) = 15.1037; (i) GA (G1) = 131.705; (j) HZ (G1) = 822; (k) ReZG1 (G1) = 37.983; (l) ReZG2 (G1) = 1028.
Proof
Suppose that Gramicidin S is represented by G1, where Er,s is the set of edges connecting vertices in the graph with corresponding degrees r and s. Between vertices of degrees r and s, the frequencies |Er,s| show the number of edges. The expression |E1,2| = 2 denotes two edges present between the vertices of degree 1 and 2, while the expression |E1,3| = 7 denotes eighteen edges present between the vertices of degree 1 and 3. Similarly, |E2,2| = 2, |E2,3| = 12 |E3,3| = 10. Then,
-
a)
By using First Zagreb Index
$$M_{{1}} \;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left( {dr + ds} \right),$$$$\begin{aligned} M_{{1}} \;\left( {G_{{1}} } \right) & = {2}\left( {{1} + {2}} \right) + {7}\left( {{1} + {3}} \right) + {2}\left( {{2} + {2}} \right) + {12}\left( {{2} + {3}} \right) + {1}0\left( {{3} + {3}} \right) \\ & = {2} \times {3} + {7} \times {4} + {2} \times {4} + {12} \times {5} + {1}0 \times {6} = {162}{\text{.}} \\ \end{aligned}$$ -
b)
By using Second Zagreb Index
$$M_{{2}} \;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left( {dr \times ds} \right),$$$$\begin{aligned} M_{{2}} \;\left( {G_{{1}} } \right) & = {2}({1} \times {2}) + {7}\left( {{1} \times {3}} \right) + {2}\left( {{2} \times {2}} \right) + {12}\left( {{2} \times {3}} \right) + {1}0\left( {{3} \times {3}} \right) \\ & = {2} \times {2} + {7} \times {3} + {2} \times {4} + {12} \times {6} + {1}0 \times {9} = {195}{\text{.}} \\ \end{aligned}$$ -
c)
By using Forgotten Index
$$H\;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \frac{2}{{\left( {dr + ds} \right)}},$$$$\begin{aligned} H\;\left( G \right) & = 2\frac{2}{1 + 2} + 7\frac{2}{1 + 3} + 2\frac{2}{2 + 2} + 12\frac{2}{2 + 3} + 10\frac{2}{3 + 3} \\ & = 2\frac{2}{3} + 7\frac{2}{4} + 2\frac{2}{4} + 12\frac{2}{5} + 10\frac{2}{6} = {13}.{966}{\text{.}} \\ \end{aligned}$$ -
d)
By using Forgotten Index
$$F\;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left[ {\left( {dr} \right)^{2} + \left( {ds} \right)^{2} } \right],$$$$\begin{aligned} {\text{F}}\;\left( {{\text{G1}}} \right) & = [{2}\left( {{12} + {22}} \right) + {7}\left( {{12} + {32}} \right) + {2}\left( {{22} + {22}} \right) + {12}\left( {{22} + {32}} \right) + {1}0\left( {{32} + {32}} \right) \\ & { = 2} \times {5} + {7} \times {1}0 + {2} \times {8} + {12} \times {13} + {1}0 \times {18} = {432}{\text{.}} \\ \end{aligned}$$ -
e)
By using Shilpa-Shanmukha Index
$${\text{SS}}\;\left( {{\text{G1}}} \right) = \mathop \sum \limits_{{{\text{rs}} \in {\text{E}}\left( {\text{G}} \right)}} \sqrt {\frac{{{\text{dr}} \times {\text{ds}}}}{{{\text{dr}} + {\text{ds}}}}} ,$$$$\begin{aligned} {\text{SS}}\;\left( {{\text{G1}}} \right) & = 2\sqrt {\frac{1 \times 2}{{1 + 2}}} + 7\sqrt {\frac{1 \times 3}{{1 + 3}} + } { }2\sqrt {\frac{2 \times 2}{{2 + 2}}} + { }12\sqrt {\frac{2 \times 3}{{2 + 3}}} { } + { }10\sqrt {\frac{3 \times 3}{{3 + 3}}} \\ & = 2\sqrt{\frac{2}{3}} + 7\sqrt {\frac{3}{4} + } { }2\sqrt{\frac{4}{4}} + { }12\sqrt{\frac{6}{5}} { } + { }10\sqrt{\frac{9}{6}} = {35}.0{88}{\text{.}} \\ \end{aligned}$$ -
f)
By using Randic Index
$$RI\;\left( {G_{1} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \sqrt {\frac{1}{dr \times ds}} ,$$$$\begin{aligned} RI\;\left( {G_{1} } \right) & = 2\sqrt {\frac{1}{1 \times 2}} + 7\sqrt {\frac{1}{1 \times 3} + } 2\sqrt {\frac{1}{2 \times 2}} + 12\sqrt {\frac{1}{2 \times 3}} + 10\sqrt {\frac{1}{3 \times 3}} \\ & = 2\sqrt{\frac{1}{2}} + 7\sqrt {\frac{1}{3} + } 2\sqrt{\frac{1}{4}} + 12\sqrt{\frac{1}{6}} + 10\sqrt{\frac{1}{9}} = {14}.{688}{\text{.}} \\ \end{aligned}$$ -
g)
By using Sum Connectivity Index
-
\(\bullet \qquad {\text{SC}}\;\left( {{\text{G}}_{{1}} } \right) = \sum \limits_{rs \in E\left( G \right)} \sqrt {\frac{1}{dr + ds}} ,\)
$$\begin{aligned} SC\;\left( {G_{1} } \right) & = 2\sqrt {\frac{1}{1 + 2}} + 7\sqrt {\frac{1}{1 + 3} + } 2\sqrt {\frac{1}{2 + 2}} + 12\sqrt {\frac{1}{2 + 3}} + 10\sqrt {\frac{1}{3 + 3}} \\ & = 2\sqrt{\frac{1}{3}} + 7\sqrt {\frac{1}{4} + } 2\sqrt{\frac{1}{4}} + 12\sqrt{\frac{1}{5}} + 10\sqrt{\frac{1}{6}} = {15}.{1}0{37}{\text{.}} \\ \end{aligned}$$
-
-
h)
By using Geometric Arithmetic Index
-
\(\bullet \qquad{\text{GA}}\;\left( {{\text{G}}_{{1}} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} 2\frac{{\sqrt {dr \times ds} }}{dr + ds},\)
$$\begin{aligned} GA\;\left( {G_{1} } \right) & = 2 \times 2\frac{{\sqrt {1 \times 2} }}{1 + 2} + 2 \times 7\frac{{\sqrt {1 \times 3} }}{1 + 3} + 2 \times 2\frac{{\sqrt {2 \times 2} }}{2 + 2} + 2 \times 12\frac{{\sqrt {2 \times 3} }}{2 + 3} + 2 \times 10\frac{{\sqrt {3 \times 3} }}{3 + 3} \\ & = 4\frac{\sqrt 2 }{2} + 14\frac{\sqrt 3 }{4} + 4\frac{\sqrt 4 }{4} + 24\frac{\sqrt 6 }{5} + 20\frac{\sqrt 9 }{6} = {31}.{7}0{53}{\text{.}} \\ \end{aligned}$$
-
-
i)
By using Hyper Zagreb Index
-
\(\bullet \qquad{\text{HZ}}\;\left( {{\text{G}}_{{1}} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left( {dr + ds} \right)^{2} ,\)
$$\begin{aligned} HZ\;\left( {G_{1} } \right) & = \left[ {{2}\left( {{1} + {2}} \right)^{{2}} + {7}\left( {{1} + {3}} \right)^{{2}} + {2}\left( {{2} + {2}} \right)^{{2}} + {12}\left( {{2} + {3}} \right)^{{2}} + {1}0\left( {{3} + {3}} \right)^{{2}} } \right] \\ & = {2}\left( {3} \right)^{{2}} + {7}\left( {4} \right)^{{2}} + {2}\left( {4} \right)^{{2}} + {12}\left( {5} \right)^{{2}} + {1}0\left( {6} \right)^{{2}} = {822}{\text{.}} \\ \end{aligned}$$
-
-
j)
By using Redefined First Zagreb Index
$$ReZ_{1} \;\left( {G_{1} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \frac{{\left( {dr \times ds} \right)}}{{\left( {dr + ds} \right)}},$$$$\begin{aligned} ReZ_{1} \;\left( {G_{1} } \right) & = 2\frac{1 \times 2}{{1 + 2}} + 7\frac{1 \times 3}{{1 + 3}} + 2\frac{2 \times 2}{{2 + 2}} + 12\frac{2 \times 3}{{2 + 3}} + 10\frac{3 \times 3}{{3 + 3}} \\ & = 2\frac{2}{3} + 7\frac{3}{4} + 2\frac{4}{4} + 12\frac{6}{5} + 10\frac{9}{6} = {37}.{9833}{\text{.}} \\ \end{aligned}$$ -
k)
By using Redefined Second Zagreb Index
-
\(\bullet \qquad {\text{ReZ}}_{{2}} \left( {{\text{G}}_{{1}} } \right) = \sum \limits_{rs \in E\left( G \right)} \left( {dr \times ds} \right)\left( {dr + ds} \right)\)
$$\begin{aligned} & = 2({1} \times {2})\;\left( {{1} + {2}} \right) + {7}\left( {{1} \times {3}} \right)\;\left( {{1} + {3}} \right) + {2}\left( {{2} \times {2}} \right)\;\left( {{2} + {2}} \right) + {12}\left( {{2} \times {3}} \right)\;\left( {{2} + {3}} \right) + {1}0\left( {{3} \times {3}} \right)\;\left( {{3} + {3}} \right) \\ & = {2} \times {2} \times {3} + {7} \times {3} \times {4} + {2} \times {4} \times {4} + {12} \times {6} \times {5} + {1}0 \times {9} \times {6} = {1028,} \\ \end{aligned}$$
-
Remark 3.2
The topological indices of other drugs can be obtained using a similar technique as that used in Theorem 1 and their output is provided in Table 2.
Although a lot of scholars are already calculating topological indices [43,44,45,46], we contribute by creating an efficient Python program (see Algorithm 1) to compute these indices. Especially, our technique can quickly compute through integrating edge partition values for every molecular graph in an elegant and seamless manner. This Python method advances the field with its efficiency by providing simplified procedures, improved accuracy and time saving for computing topological indices.
Theorem 1 and Algorithm 1 can both be used to compute topological indices; however algorithmic approach is more effective and beneficial in this respect. Moreover, Table 3 shows the physio-chemical properties of selected drugs collected from ChemSpider [47] and PubChem [48] and the computed TIs obtained from their molecular structures by developing python algorithm respectively as seen above.
Supervised machine learning
Within the field of artificial intelligence, machine learning focuses on creating statistical models and algorithms that allow computers to learn and make decisions without explicit programming. The development of drugs usually involves machine learning techniques like Random Forest Algorithm (RFA), Extreme Gradient Boosting (XGB), and linear analysis. Linear analysis techniques like linear regression are helpful for simpler, easier-to-understand models, ensemble learning techniques like XGB and RFA are capable of managing complex nonlinear correlations and interactions in data.
Random forest
For machine learning tasks including regression, RFA is a potent ensemble learning technique. During training, it builds a large number of decision trees, and it produces the mean prediction (regression) of each individual tree. In order to begin, RF bootstraps a technique many random sections of the training set. A decision tree is trained using each subset, also referred to as a bootstrap sample. At every split point, a decision tree is built for every bootstrap sample using a random subset of features. The model performs better overall because of this randomness, which aids in decorrelation between the trees. Without any pruning, each tree is grown to its fullest depth. When every tree is constructed, its predictions are combined using the Random Forest algorithm. The following is a mathematical representation of the prediction formula for regression:
where Y′ is the predicted output, y1, y2,…..yn are the predicted outputs from individual decision trees, and n is the total number of trees in the Random Forest. Figure 4 represent the feature importance of some physiochemical properties w.r.t topological indices; also Figs. 5 and 6 illustrate the decision trees.
Violin plots highlight gaps in the data distribution and help evaluate the accuracy of predictions against actual values graphically as shown in Figs. 7, 8 and 9. RFA output error measures are shown in Table 4 and include specific parameters like Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). The following formulas can be used to determine MAE, MSE, and RMSE:
-
\(\bullet \qquad{\text{MAE}} = \frac{1}{{\text{n}}}\sum \left| {{\text{actual}} - {\text{predicted}}} \right|,\)
-
\(\bullet \qquad{\text{MSE}} = \frac{1}{{\text{n}}}\sum \left( {{\text{actual}} - {\text{predicted}}} \right)^{2} ,\)
-
\(\bullet \qquad{\text{RMSE}} = \frac{1}{{\text{n}}}\sqrt {\left( {\sum \left( {{\text{actual}} - {\text{predicted}}} \right)^{2} } \right)} .\)
The random forest algorithm’s performance and prediction accuracy were examined through information gained from both the violin plots and tables.
Linear regression
Linear regression is a fundamental supervised machine learning technique that predicts the connection between dependent variable and one or more independent variables. These models quantify the relationship between drug structures and their medical impacts through the use of various components, such as TIs. The QSPR results are constantly provided by the regression equation, which makes it an invaluable formula that is P = X + Y (TI). Here, P is the physiochemical parameter of a potential drug. Topological index, constant and regression coefficient are indicated by the symbols TI, X and Y respectively. The correlation coefficients between each topological indicator and the nine physio-chemical parameters are calculated and shown in Table 5 while, bar graph representing the correlation coefficients of all physio-chemical properties across different topological indices is shown in Fig. 10. Linear regression equations and physio-chemical properties w.r.t TIs derived below.
Linear regression models w.r.t TIs
Regression model for [M2(G)] | Regression model for M1(G) |
Molecular weight = 22.1100 + 2.1165[M2(G)] | Molecular weight = 20.3377 + 2.4977[M1(G)] |
Complexity = 0.2808 + 3.5990[M2(G)] | Complexity = 16.339 + 4.1416[M1(G)] |
Density = 1.7482 − 0.0016[M2(G)] | Density = 1.7593 − 0.0020[M1(G)] |
Flash point = 46.9325 + 1.5793[M2(G)] | Flash point = 45.2833 + 1.8655[M1(G)] |
Molar volume = − 45.305 + 1.8798[M1(G)] | Molar volume = − 50.941 + 2.2409[M1(G)] |
Surface tension = 79.4937 − 0.0705[M2(G)] | Surface tension = 79.4230 − 0.0825[M1(G)] |
Polarizability = − 1.6054 + 0.2440[M2(G)] | Polarizability = − 2.3993 + 0.2913[M1(G)] |
Boiling point = 154.073 + 2.6109[M2(G)] | Boiling point = 151.346 + 3.0842[M1(G)] |
Enthalpy of variation = 32.0454 + 0.3643[M2(G)] | Enthalpy of variation = 31.5778 + 0.4309[M1(G)] |
Regression model for F(G) | Regression model for H(G) |
Molecular weight = 56.1270 + 0.8636[F(G)] | Molecular weight = 14.0675 + 29.5447[H(G)] |
Complexity = 42.5485 + 1.5009[F(G)] | Complexity = 50.3265 + 46.1198[H(G)] |
Density = 1.7062 − 0.0006[F(G)] | Density = 1.79322 − 0.0254[H(G)] |
Flash point = 68.2462 + 0.6529[F(G)] | Flash point = 44.9218 + 21.7875[H(G)] |
Molar volume = − 11.6411 + 0.7598[F(G)] | Molar volume = − 66.0938 + 27.1224[H(G)] |
Surface tension = 77.8286 − 0.0277[F(G)] | Surface tension = 80.3293 − 1.0205[H(G)] |
Polarizability = 2.6997 + 0.0987[F(G)] | Polarizability = − 4.2840 + 3.5197[H(G)] |
Boiling point = 189.3117 + 1.0793[F(G)] | Boiling point = 150.7451 + 36.0198[H(G)] |
Enthalpy of variation = 36.8359 + 0.1509[F(G)] | Enthalpy of variation = 31.3317 + 5.0423[H(G)] |
Regression model for ABC | Regression model for SS(G) |
Molecular weight = 20.3827 + 16.9418[ABC(G)] | Molecular weight = 9.7915 + 11.7757[SS(G)] |
Complexity = 37.6722 + 27.2928[ABC(G)] | Complexity = 16.8938 + 19.0652[SS(G)] |
Density = 1.7710 − 0.0140[ABC(G)] | Density = 1.7819 − 0.0098[SS(G)] |
Flash point = 46.5920 + 12.6059[ABC(G)] | Flash point = 39.4157 + 8.7440[SS(G)] |
Molar volume = − 55.2781 + 15.3641[ABC(G)] | Molar volume = − 64.516 + 10.6698[SS(G)] |
Surface tension = 79.6061 − 0.5662[ABC(G)] | Surface tension = 80.0087 − 0.3948[SS(G)] |
Polarizability = − 2.9516 + 1.9965[ABC(G)] | Polarizability = − 4.1564 + 1.3866[SS(G)] |
Boiling point = 153.5087 + 20.8404[ABC(G)] | Boiling point = 141.6433 + 14.4558[SS(G)] |
Enthalpy of variation = 31.769 + 2.915[ABC(G)] | Enthalpy of variation = 30.2478 + 2.0188[SS(G)] |
Regression model for SC | Regression model for RI |
Molecular weight = 13.3437 + 27.1882[SC(G)] | Molecular weight = 16.2195 + 28.1100[RI(G)] |
Complexity = 39.4253 + 43.0217[SC(G)] | Complexity = 47.5607 + 44.2587[RI(G)] |
Density = 1.7876 − 0.0230[SC(G)] | Density = 1.7843 − 0.0238[RI(G)] |
Flash point = 43.2542 + 20.1170[SC(G)] | Flash point = 45.7931 + 20.7737[RI(G)] |
Molar volume = − 64.6865 + 24.836[SC(G)] | Molar volume = − 62.3854 + 25.6982[RI(G)] |
Surface tension = 80.1367 − 0.9262[SC(G)] | Surface tension = 80.1996 − 0.9676[RI(G)] |
Polarizability = − 4.1379 + 3.2252[SC(G)] | Polarizability = − 3.7882 + 3.3340[RI(G)] |
Boiling point = 147.989 + 33.2581[SC(G)] | Boiling point = 152.1868 + 34.3436[RI(G)] |
Enthalpy of variation = 30.9983 + 4.6526[SC(G)] | Enthalpy of variation = 31.4694 + 4.8117[RI(G)] |
Regression model for HZ | Regression model for GA |
Molecular weight = 123.0392 + 0.3994[HZ(G)] | Molecular weight = 9.4468 + 13.0085[GA(G)] |
Complexity = 170.448 + 0.6808[HZ(G)] | Complexity = 28.0960 + 20.7296[GA(G)] |
Density = 1.6200 − 0.0003[HZ(G)] | Density = 1.7901 − 0.0110[GA(G)] |
Flash point = 128.5374 + 0.2908[HZ(G)] | Flash point = 40.1194 + 9.6328[GA(G)] |
Molar volume = 52.1288 + 0.3458[HZ(G)] | Molar volume = − 67.435 + 11.8602[GA(G)] |
Surface tension = 76.9305 − 0.0142[HZ(G)] | Surface tension = 80.1620 − 0.4401[GA(G)] |
Polarizability = 11.1992 + 0.0447[HZ(G)] | Polarizability = − 4.5259 + 1.5410[GA(G)] |
Boiling point = 288.9813 + 0.4807[HZ(G)] | Boiling point = 142.8059 + 15.9243[GA(G)] |
Enthalpy of variation = 49.5430 + 0.0686[HZ(G)] | Enthalpy of variation = 30.377 + 2.2248[GA(G)] |
Regression model for ReZ2 | Regression model for ReZ1 |
Molecular weight = 57.3870 + 0.3780[ReZ2(G)] | Molecular weight = 6.6707 + 11.0317[ReZ1(G)] |
Complexity = 29.1286 + 0.6712[ReZ2(G)] | Complexity = 3.7703 + 18.0524[ReZ1(G)] |
Density = 1.6995 − 0.0003[ReZ2(G)] | Density = 1.7808 − 0.0091[ReZ1(G)] |
Flash point = 71.4891 + 0.2836[ReZ2(G)] | Flash point = 37.1349 + 8.1906[ReZ1(G)] |
Molar volume = − 7.2151 + 0.3295[ReZ2(G)] | Molar volume = − 65.6712 + 9.9559[ReZ1(G)] |
Surface tension = 78.3134 − 0.0126[ReZ2(G)] | Surface tension = 80.1156 − 0.3699[ReZ1(G)] |
Polarizability = 3.4246 + 0.0427[ReZ2(G)] | Polarizability = − 4.2984 + 1.2936[ReZ1(G)] |
Boiling point = 194.6729 + 0.4689[ReZ2(G)] | Boiling point = 137.8725 + 13.5410[ReZ1(G)] |
Enthalpy of variation = 37.6498 + 0.0655[ReZ2(G)] | Enthalpy of variation = 29.7872 + 1.8895[ReZ1(G)] |
Computation of statistical parameters
The use of statistical parameters to compare Topological Indices (TIs) with characteristic of correlation coefficients is useful in model analysis. In a regression model, the standard error (SE) of the estimate measures the mean variance of expected outcomes from actual values, Tables 6, 7 and 8 shows the SE, F-statistics and significance p values. Furthermore, comparison graphs through Figs. 11, 12, 13, 14, 15, 16, 17, 18 and 19 include both actually acquired and mathematically derived physio-chemical property values from regression models.
Additionally, the majority of p-values are less than 0.05 a specific value, and mostly r exceeds 0.6 on a consistent basis as seen in Table 4.
Extreme gradient boosting
Extreme Gradient Boosting, is a powerful machine learning method that is well-known for its efficiency in predictive mathematical modeling, here we provided Pseudo-code namely Algorithm-2, provides useful information about XGB, including information about its flexibility and adaptability. The distributions plot of the actual and predicted values are shown in Figs. 20, 21 and 22, which are essential for evaluating the effectiveness of the model and detecting any variations. Furthermore aiding in our understanding is the violin plot, which displays the data distribution graphically while highlighting the peculiarities specific to XGB. Table 9 also offers error estimates, which helps towards a comprehensive review of the model’s predictive power and general accuracy while using XGB algorithm, having a well-organized overview of implementation procedures like the one provided by pseudo-code proves invaluable for expediting the process and improving understanding of its complexities.
Physio-chemical parameters comparison analysis
When XGB and RFA were used to forecast the physio-chemical properties of anti-HIV medicines, the results showed that XGB predictions consistently produced higher values than RFA. This implies that when it comes to the physio-chemical characteristics of anti-HIV drugs, the XGB algorithm typically yields more optimistic forecasts.
Even though these two machine learning models provide insightful information about the structure–activity relationship of associated drugs, the difference in predicted values emphasizes how crucial it is to take into account a variety of computational strategies and validation methods in order to guarantee the precision and dependability of predictions made during the drug discovery and development process. Tables 10 and 11 are the Experimental and actual data for prediction of RFA and XGB w.r.t physical properties as well as through Figs. 23, 24, 25, 26 and 27 shown the graphical comparison between XGB and RFA listed below.
Standard errors measurements like MAE, MSE, and RMSE are used to evaluate the performance of predictive models like RFA and XGB. To evaluate the relative efficiency of the models and compare the error indicators, visualizations such as tables and graphs were used. In terms of prediction accuracy, XGB performed better than RFA, as seen by lower MAE, MSE, and RMSE values. Furthermore, compared to RFA, greater R2 values for XGB demonstrated a better fit of the model to the data. It was easier to comprehend why XGB is such a strong algorithm for predictive modeling problems compared to the graphical representations and error tables.
Conclusions
The conclusion of our analysis gives information on the potential efficacy of the drugs under examination in treating HIV-1 disease. In order to predict physio-chemical properties, we compared ability to forecast of RFA, Linear Regression, and XGB in this work. Metrics including MAE, MSE, RMSE, and R2 values were used to assess their effectiveness. With substantially lower error rates and higher R2 values than the other models, XGB performed better. The efficacy of XGB was further demonstrated by graphical representations. Particularly in the treatment of HIV, the findings have important implications for drug development. Using machine learning algorithms such as XGB can improve drug property prediction efficiency. The superiority of XGB is derived from its iterative prediction refining. Some more techniques and data-set optimization may be investigated in future studies. The research contributes to larger-scale predictive modeling efforts in the pharmaceutical industry. The possibilities of predictive modeling will grow with further development of machine learning techniques. Overall, this work shows that advanced algorithms can be used to improve the drug development process.
Data availability
No datasets were generated or analysed during the current study.
References
Khan MM, Khan MM. Acquired immune deficiency syndrome. In: Immunopharmacology. Cham: Springer; 2016. p. 293–330.
Sellier P, et al. Updated mortality and causes of death in 2020–2021 in people with HIV: a multicenter study in France. AIDS. 2023;37(13):2007–13.
Okoye AA, Picker LJ. CD 4+ T-cell depletion in HIV infection: mechanisms of immunological failure. Immunol Rev. 2013;254(1):54–64.
Paiardini M, Müller-Trutwin M. HIV-associated chronic immune activation. Immunol Rev. 2013;254(1):78–101.
Veazey RS. Intestinal CD4 depletion in HIV/SIV infection. Curr Immunol Rev. 2019;15(1):76–91.
Wilson NL, et al. Identifying symptom patterns in people living with HIV disease. J Assoc Nurses AIDS Care. 2016;27(2):121–32.
Joseph SB, et al. HIV-1 target cells in the CNS. J Neurovirol. 2015;21:276–89.
Hu L, et al. Dual-channel hypergraph convolutional network for predicting herb–disease associations. Brief Bioinform. 2024;25(2): bbae067.
Zhao B-W, et al. Motif-aware miRNA-disease association prediction via hierarchical attention network. IEEE J Biomed Health Inform. 2024;28(7):4281–94.
Zhao B-W, et al. iGRLDTI: an improved graph representation learning method for predicting drug–target interactions over heterogeneous biological information network. Bioinformatics. 2023;39(8): btad451.
Zhao B-W, et al. A geometric deep learning framework for drug repositioning over heterogeneous information networks. Brief Bioinform. 2022;23(6): bbac384.
Lv Q, et al. TCMBank: bridges between the largest herbal medicines, chemical ingredients, target proteins, and associated diseases with intelligence text mining. Chem Sci. 2023;14(39):10684–701.
Lv Q, et al. TCMBank-the largest TCM database provides deep learning-based Chinese-Western medicine exclusion prediction. Signal Transduct Target Ther. 2023;8(1):127.
Lv Q, et al. Meta learning with graph attention networks for low-data drug discovery. IEEE Trans Neural Netw Learn Syst. 2023;35(8):11218–30.
Lv Q, et al. Meta-molnet: a cross-domain benchmark for few examples drug discovery. IEEE Trans Neural Netw Learn Syst. 2024. https://doi.org/10.1109/TNNLS.2024.335965.
Lv Q, et al. Mol2Context-vec: learning molecular representation from context awareness for drug discovery. Brief Bioinform. 2021;22(6): bbab317.
Lv Q, et al. 3D graph neural network with few-shot learning for predicting drug–drug interactions in scaffold-based cold start scenario. Neural Netw. 2023;165:94–105.
Ahmed W, et al. A python based algorithmic approach to optimize sulfonamide drugs via mathematical modeling. Sci Rep. 2024;14(1):12264.
Zaman S, et al. On neighborhood eccentricity-based topological indices with QSPR analysis of PAHs drugs. Meas Interdiscip Res Perspect. 2024. https://doi.org/10.1080/15366367.2024.2329950.
Ahmed W, et al. Molecular insights into anti-Alzheimer’s drugs through predictive modeling using linear regression and QSPR analysis. Modern Phys Lett B. 2024. https://doi.org/10.1142/S0217984924502609.
Zaman S, et al. Mathematical modeling and topological graph description of dominating David derived networks based on edge partitions. Sci Rep. 2023;13(1):15159.
Zaman S, et al. Mathematical analysis and molecular descriptors of two novel metal–organic models with chemical applications. Sci Rep. 2023;13(1):5314.
Aqib M, et al. On topological indices of some chemical graphs. Mol Phys. 2023. https://doi.org/10.1080/00268976.2023.2276386.
Bhatia KS, Gupta AK, Saxena AK. Physicochemical significance of topological indices: importance in drug discovery research. Curr Top Med Chem. 2023;23(29):2735–42.
Zanni R, et al. What place does molecular topology have in today’s drug discovery? Expert Opin Drug Discov. 2020;15(10):1133–44.
Ullah A, Bano Z, Zaman S. Computational aspects of two important biochemical networks with respect to some novel molecular descriptors. J Biomol Struct Dyn. 2024;42(2):791–805.
Ullah A, et al. Predictive potential of K-Banhatti and Zagreb type molecular descriptors in structure–property relationship analysis of some novel drug molecules. J Chin Chem Soc. 2024;71(3):250–76.
Zaman S, et al. Three-dimensional structural modelling and characterization of sodalite material network concerning the irregularity topological indices. J Math. 2023;2023(1):5441426.
Zhang X, et al. The study of curve fitting models to analyze some degree-based topological indices of certain anti-cancer treatment. Chem Pap. 2024;78(2):1055–68.
Meharban S, et al. Molecular structural modeling and physical characteristics of anti-breast cancer drugs via some novel topological descriptors and regression models. Curr Res Struct Biol. 2024;7: 100134.
Patel HM, et al. Quantitative structure–activity relationship (QSAR) studies as strategic approach in drug discovery. Med Chem Res. 2014;23:4991–5007.
Zaman S, et al. QSPR analysis of some novel drugs used in blood cancer treatment via degree based topological indices and regression models. Polycycl Aromat Compd. 2023;44:1–17.
Hakeem A. et al. QSPR analysis of some novel drugs used for cardiovascular diseases through degree-based topological indices and regression models. 2023.
Gutman I, Polansky OE. Mathematical concepts in organic chemistry. Berlin: Springer Science & Business Media; 2012.
Fajtlowicz S. On conjectures of Graffiti-II. Congr Numer. 1987;60:187–97.
Furtula B, Gutman I. A forgotten topological index. J Math Chem. 2015;53(4):1184–90.
Zhao W, et al. Computing SS index of certain dendrimers. J Math. 2021;2021:1–14.
Ashraful Alam M, et al. Degree-based entropy for a non-kekulean benzenoid graph. J Math. 2022;2022:1–12.
Gutman I, Furtula B, Katanić V. Randić index and information. AKCE Int J Graphs Comb. 2018;15(3):307–12.
Farahani MR. On the Randic and sum-connectivity index of nanotubes. Ann West Univ Timisoara-Math Comput Sci. 2013;51(2):39–46.
Shirdel GH, Rezapour H, Sayadi AM. The hyper-zagreb index of graph operations. Iran J Math Chem. 2013;4(2):213–20.
Ranjini P, Lokesha V, Usha A. Relation between phenylene and hexagonal squeeze using harmonic index. Int J Graph Theory. 2013;1(4):116–21.
Havare ÖÇ. Topological indices and QSPR modeling of some novel drugs used in the cancer treatment. Int J Quantum Chem. 2021;121(24): e26813.
Kirmani SAK, Ali P, Azam F. Topological indices and QSPR/QSAR analysis of some antiviral drugs being investigated for the treatment of COVID-19 patients. Int J Quantum Chem. 2021;121(9): e26594.
Gnanaraj LRM, Ganesan D, Siddiqui MK. Topological indices and QSPR analysis of NSAID drugs. Polycycl Aromat Compd. 2023;43(10):9479–95.
Huang L, et al. Topological indices and QSPR modeling of new antiviral drugs for cancer treatment. Polycycl Aromat Compd. 2023;43(9):8147–70.
Pence HE, Williams A. ChemSpider: an online chemical information resource. Washington, DC: ACS Publications; 2010.
Kim S, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–9.
Acknowledgements
The authors extend their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2024-94).
Funding
This research was funded by Taif University, Saudi Arabia, Project No. TU-DSPP-2024-94).
Author information
Authors and Affiliations
Contributions
All the authors Wakeel Ahmed, Shahid Zaman, Eizzah Asif, Kashif Ali, Emad E. Mahmoud and Mamo Abebe Asheboss have equally contributed to this manuscript in all stages, from conceptualization to the write-up of final draft.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
The manuscript has been approved by all authors and consent for publication has been granted.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ahmed, W., Zaman, S., Asif, E. et al. Exploring the role of topological descriptors to predict physicochemical properties of anti-HIV drugs by using supervised machine learning algorithms. BMC Chemistry 18, 167 (2024). https://doi.org/10.1186/s13065-024-01266-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13065-024-01266-4