 Research
 Open access
 Published:
Exploring the role of topological descriptors to predict physicochemical properties of antiHIV drugs by using supervised machine learning algorithms
BMC Chemistry volume 18, Article number: 167 (2024)
Abstract
In order to explore the role of topological indices for predicting physiochemical properties of antiHIV drugs, this research uses python programbased algorithms to compute topological indices as well as machine learning algorithms. Degreebased topological indices are calculated using Python algorithm, providing important information about the structural behavior of drugs that are essential to their antiHIV effectiveness. Furthermore, machine learning algorithms analyze the physiochemical properties that correspond to antiHIV activities, making use of their ability to identify complex trends in large, convoluted datasets. In addition to improving our comprehension of the links between molecular structure and effectiveness, the collaboration between machine learning and QSPR research further highlights the potential of computational approaches in drug discovery. This work reveals the mechanisms underlying antiHIV effectiveness, which paves the way for the development of more potent antiHIV drugs. This work reveals the mechanisms underlying antiHIV efficiency, which paves the way for the development of more potent antiHIV drugs which demonstrates the invaluable advantages of machine learning in assessing drug properties by clarifying the biological processes underlying antiHIV behavior, which paves the way for the design and development of more effective antiHIV drugs.
Introduction
Human Immunodeficiency Virus (HIV) was firstly identified in the early 1980s as a consequence of the appearance of an immune systemdamaging disease [1]. Later on, the illness was identified as Acquired Immunodeficiency Syndrome (AIDS). In 1983–1984, French scientists Francoise BarreSinoussi and Luc Montagnier became essential in discovering the virus. HIV caused a global pandemic that has killed countless people and infected millions of people globally. Its impact on global health is immense, as it not only threatens human health but also affects economies and healthcare systems around the globe [2]. There are two primary types of HIV: HIV1, which is common surrounding the world, and HIV2, which is primarily linked to West Africa. Here we focus on HIV1, HIV1 target CD4 cells by engaging to their surface receptors, which starts the process of the virus entering and taking control of the cell’s functions as shown in Fig. 1. New viruses are created as a result, ultimately the CD4 cells are destroyed HIV causes the immune system to become extremely weakened by destroying CD4 cells, which sets off a series of immune issues. Gradually, this causes CD4 cell depletion. The immune system’s capacity to mount effective defenses against infections is weakened by a decrease in CD4 cells [1, 3,4,5]. Breast milk, vaginal fluids, rectal fluids, semen and blood represent some of the bodily fluids that may transmit the virus. These bodily fluids can spread HIV when persons engage in risky sexual behavior, share needles with injecting drug users, or are pregnant, giving birth, or nursing a kid [6]. The goal of antiviral therapy is to stop HIV1 replication in order to protect CD4 cell levels and immune system health [7]. An extensive variety of drugs, including Rilpivirine, Nevirapine, Emtricitabine, Delavirdine, Elvitegravir, Ritonavir, Saquinavir, Indinavir, and Bictegravir (these drugs are referred to as a, b, c,…., i respectively, as shown in Fig. 2 and their molecular graphs represented in Fig. 3) are required to cure HIV1. These drugs are used to treat HIV1 infection and stop the HIV virus from growing and from spreading throughout the body by a number of distinct mechanisms. By doing this, these drugs contribute to the regulation of HIV levels in the blood, which protects CD4 cells. In the area of HIV1 analysis, graph theory provides a fundamental statistical application particularly in the field of chemistry and drugs development. Some embedding’s of drugs and diseases through the dualchannel network are characterized in [8,9,10,11]. On the other hand, the bridges between largest herbal medicines, chemical ingredients, target proteins, and associated diseases with respect to the neural network and deep learningbased invariants are discussed in [12,13,14,15,16,17].
Graph theory is essential to the analysis of biochemical networks in medicine, including drugtarget relationships and protein–protein interactions [18,19,20,21,22]. To aid in the identification of possible drug candidates and the optimization of drug design, graphs depict pharmaceuticals as nodes and their interactions with targets as edges. Furthermore, proteins are shown as nodes in graphs that represent protein–protein interactions as edges. This makes it possible to identify important protein hubs and pathways that are connected to disease causes and potential treatment approaches. Topological indices (TIs) from graph theory are essential for drugs discovery [23,24,25].
Our main goal is to conduct an extensive review of nine selected antiviral drugs for HIV1. Using Python algorithm, which involves finding their degree base TIs such as (Randic, Sum Connectivity, First Zagreb, Second Zagreb) Indices which shown in Table 1 by developing python algorithm based on graph theory. Python programs are essential resources for researchers examining the chemical properties of drugs and computing topological indices. In addition to improving analytical efficiency by automating repetitive processes and quickly processing enormous data sets, the computational approach offers substantial benefits in the simultaneous research of many drugs. By revealing complex links between molecular descriptors and biological activities, the integration of physiochemical characteristics such as molecular weight (MW), complexity (Comp), density (Den), flash point (FP), molar volume (MV), surface tension (ST), polarizability (Pol), boiling point (BP) and enthalpy of vaporization (EV) into the study through machine learning algorithms, contributes to our understanding of the potential efficacy and safety profiles of drugs against HIV. In order to provide a thorough understanding of the molecular characteristics of HIV drugs and to provide insights into their modes of action and potential side effects, it is imperative to combine topological indices with physiochemical parameters. It is essential to combine topological indices with physiochemical parameters to provide a comprehensive understanding of the molecular properties of HIV drugs, as well as insights into their modes of action and potential adverse effects. In order to predict drug efficacy based on molecular features, researchers utilize supervised machine learning models to establish quantitative correlations between calculated molecular descriptors and observed biological activity. Supervised machine learning predictive models offer valuable insights into the potential efficacy of antiHIV drug by analyzing their molecular properties and estimating their effectiveness against the illness. The utilization of Quantitative Structure–Property Relationship (QSPR) analysis is becoming increasingly important in understanding the relationships between drug structures and biological behavior [26,27,28,29,30]. QSPR analysis provides a rational framework for drug design and optimization [31,32,33]. By combining computational methods and QSPR analysis, researchers hope to obtain a deeper understanding of the molecular mechanisms underlying antiHIV drugs, which will help in the development of more focused and efficient treatment options.
Material and method
We initially determined the edge partition based on graph connectivity was adopted to define molecular graphs, which is an important step in recognizing the structural properties. Then, degreebased TIs were calculated through analyzing the molecular graph’s node degree variation. To make this process easier, a unique Python algorithm was developed. After that, Python programs were used to develop machine learning methods for the analysis of physiochemical properties. Furthermore, using Statistical Package for the Social Sciences (SPSS) software to analyze relationships between the computed indices and experimental features, we also performed graphical comparison analysis between actual and computed drug property, ensuring the accuracy and credibility of our results.
Data acquisition and preparation

We utilized the latest version of python 3.12 to compute topological indices and sourced physiochemical properties from online database Chemspider (https://www.chemspider.com) and Pubchem (https://pubchem.ncbi.nlm.nih.gov). The topological descriptors were employed as feature variables (input variables), while the physiochemical properties served as target variables. Our analysis covered a dataset composed of multiple feature variables and target variables, representing a considerable amount of data points.

Given that our dataset is labeled, we opted for supervised machine learning algorithms, specifically Random Forest and XGBoost, to analyze the data and derive insights. RF is chosen for its proficiency in handling overfitting through its ensemble approach, where multiple decision trees contribute to a more stable and accurate prediction while XGBoost is based on the gradient boosting framework, which builds one tree at a time. Each new tree helps to correct errors made by previously trained tree models. By averaging several trees, Random Forest reduces the risk of overfitting, which is common with single decision trees while XGBoost is based on the gradient boosting framework, which builds one tree at a time. Each new tree helps to correct errors made by previously trained tree models.

The primary libraries utilized for Random Forest and XGBoost are:

“pandas” for data manipulation,

“numpy” for numerical operations,

“scikitlearn” for machine learning algorithms, including Random Forest and XGBoost,

“matplotlib” and “seaborn” for data visualization,

Computational resources: the computations were performed on a machine with an Intel core i7 processor and 16 GB of RAM.

Results and discussion
Theorem 1
Let G be a graph and G_{1} denotes the elvitegravir, then the following axioms holds for the graph G_{1}:
(a) M_{1} (G_{1}) = 162; (b) M_{2} (G_{1}) = 195; (c) H (G_{1}) = 13.966; (d) F (G_{1}) = 432; (e) SS (G_{1}) = 35.088; (f) ABC (G_{1}) = 23.695; (g) RI (G_{1}) = 14.688; (h) SC (G_{1}) = 15.1037; (i) GA (G_{1}) = 131.705; (j) HZ (G_{1}) = 822; (k) ReZG1 (G_{1}) = 37.983; (l) ReZG2 (G_{1}) = 1028.
Proof
Suppose that Gramicidin S is represented by G1, where E_{r,s} is the set of edges connecting vertices in the graph with corresponding degrees r and s. Between vertices of degrees r and s, the frequencies E_{r,s} show the number of edges. The expression E_{1,2} = 2 denotes two edges present between the vertices of degree 1 and 2, while the expression E_{1,3} = 7 denotes eighteen edges present between the vertices of degree 1 and 3. Similarly, E_{2,2} = 2, E_{2,3} = 12 E_{3,3} = 10. Then,

a)
By using First Zagreb Index
$$M_{{1}} \;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left( {dr + ds} \right),$$$$\begin{aligned} M_{{1}} \;\left( {G_{{1}} } \right) & = {2}\left( {{1} + {2}} \right) + {7}\left( {{1} + {3}} \right) + {2}\left( {{2} + {2}} \right) + {12}\left( {{2} + {3}} \right) + {1}0\left( {{3} + {3}} \right) \\ & = {2} \times {3} + {7} \times {4} + {2} \times {4} + {12} \times {5} + {1}0 \times {6} = {162}{\text{.}} \\ \end{aligned}$$ 
b)
By using Second Zagreb Index
$$M_{{2}} \;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left( {dr \times ds} \right),$$$$\begin{aligned} M_{{2}} \;\left( {G_{{1}} } \right) & = {2}({1} \times {2}) + {7}\left( {{1} \times {3}} \right) + {2}\left( {{2} \times {2}} \right) + {12}\left( {{2} \times {3}} \right) + {1}0\left( {{3} \times {3}} \right) \\ & = {2} \times {2} + {7} \times {3} + {2} \times {4} + {12} \times {6} + {1}0 \times {9} = {195}{\text{.}} \\ \end{aligned}$$ 
c)
By using Forgotten Index
$$H\;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \frac{2}{{\left( {dr + ds} \right)}},$$$$\begin{aligned} H\;\left( G \right) & = 2\frac{2}{1 + 2} + 7\frac{2}{1 + 3} + 2\frac{2}{2 + 2} + 12\frac{2}{2 + 3} + 10\frac{2}{3 + 3} \\ & = 2\frac{2}{3} + 7\frac{2}{4} + 2\frac{2}{4} + 12\frac{2}{5} + 10\frac{2}{6} = {13}.{966}{\text{.}} \\ \end{aligned}$$ 
d)
By using Forgotten Index
$$F\;\left( G \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left[ {\left( {dr} \right)^{2} + \left( {ds} \right)^{2} } \right],$$$$\begin{aligned} {\text{F}}\;\left( {{\text{G1}}} \right) & = [{2}\left( {{12} + {22}} \right) + {7}\left( {{12} + {32}} \right) + {2}\left( {{22} + {22}} \right) + {12}\left( {{22} + {32}} \right) + {1}0\left( {{32} + {32}} \right) \\ & { = 2} \times {5} + {7} \times {1}0 + {2} \times {8} + {12} \times {13} + {1}0 \times {18} = {432}{\text{.}} \\ \end{aligned}$$ 
e)
By using ShilpaShanmukha Index
$${\text{SS}}\;\left( {{\text{G1}}} \right) = \mathop \sum \limits_{{{\text{rs}} \in {\text{E}}\left( {\text{G}} \right)}} \sqrt {\frac{{{\text{dr}} \times {\text{ds}}}}{{{\text{dr}} + {\text{ds}}}}} ,$$$$\begin{aligned} {\text{SS}}\;\left( {{\text{G1}}} \right) & = 2\sqrt {\frac{1 \times 2}{{1 + 2}}} + 7\sqrt {\frac{1 \times 3}{{1 + 3}} + } { }2\sqrt {\frac{2 \times 2}{{2 + 2}}} + { }12\sqrt {\frac{2 \times 3}{{2 + 3}}} { } + { }10\sqrt {\frac{3 \times 3}{{3 + 3}}} \\ & = 2\sqrt{\frac{2}{3}} + 7\sqrt {\frac{3}{4} + } { }2\sqrt{\frac{4}{4}} + { }12\sqrt{\frac{6}{5}} { } + { }10\sqrt{\frac{9}{6}} = {35}.0{88}{\text{.}} \\ \end{aligned}$$ 
f)
By using Randic Index
$$RI\;\left( {G_{1} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \sqrt {\frac{1}{dr \times ds}} ,$$$$\begin{aligned} RI\;\left( {G_{1} } \right) & = 2\sqrt {\frac{1}{1 \times 2}} + 7\sqrt {\frac{1}{1 \times 3} + } 2\sqrt {\frac{1}{2 \times 2}} + 12\sqrt {\frac{1}{2 \times 3}} + 10\sqrt {\frac{1}{3 \times 3}} \\ & = 2\sqrt{\frac{1}{2}} + 7\sqrt {\frac{1}{3} + } 2\sqrt{\frac{1}{4}} + 12\sqrt{\frac{1}{6}} + 10\sqrt{\frac{1}{9}} = {14}.{688}{\text{.}} \\ \end{aligned}$$ 
g)
By using Sum Connectivity Index

\(\bullet \qquad {\text{SC}}\;\left( {{\text{G}}_{{1}} } \right) = \sum \limits_{rs \in E\left( G \right)} \sqrt {\frac{1}{dr + ds}} ,\)
$$\begin{aligned} SC\;\left( {G_{1} } \right) & = 2\sqrt {\frac{1}{1 + 2}} + 7\sqrt {\frac{1}{1 + 3} + } 2\sqrt {\frac{1}{2 + 2}} + 12\sqrt {\frac{1}{2 + 3}} + 10\sqrt {\frac{1}{3 + 3}} \\ & = 2\sqrt{\frac{1}{3}} + 7\sqrt {\frac{1}{4} + } 2\sqrt{\frac{1}{4}} + 12\sqrt{\frac{1}{5}} + 10\sqrt{\frac{1}{6}} = {15}.{1}0{37}{\text{.}} \\ \end{aligned}$$


h)
By using Geometric Arithmetic Index

\(\bullet \qquad{\text{GA}}\;\left( {{\text{G}}_{{1}} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} 2\frac{{\sqrt {dr \times ds} }}{dr + ds},\)
$$\begin{aligned} GA\;\left( {G_{1} } \right) & = 2 \times 2\frac{{\sqrt {1 \times 2} }}{1 + 2} + 2 \times 7\frac{{\sqrt {1 \times 3} }}{1 + 3} + 2 \times 2\frac{{\sqrt {2 \times 2} }}{2 + 2} + 2 \times 12\frac{{\sqrt {2 \times 3} }}{2 + 3} + 2 \times 10\frac{{\sqrt {3 \times 3} }}{3 + 3} \\ & = 4\frac{\sqrt 2 }{2} + 14\frac{\sqrt 3 }{4} + 4\frac{\sqrt 4 }{4} + 24\frac{\sqrt 6 }{5} + 20\frac{\sqrt 9 }{6} = {31}.{7}0{53}{\text{.}} \\ \end{aligned}$$


i)
By using Hyper Zagreb Index

\(\bullet \qquad{\text{HZ}}\;\left( {{\text{G}}_{{1}} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \left( {dr + ds} \right)^{2} ,\)
$$\begin{aligned} HZ\;\left( {G_{1} } \right) & = \left[ {{2}\left( {{1} + {2}} \right)^{{2}} + {7}\left( {{1} + {3}} \right)^{{2}} + {2}\left( {{2} + {2}} \right)^{{2}} + {12}\left( {{2} + {3}} \right)^{{2}} + {1}0\left( {{3} + {3}} \right)^{{2}} } \right] \\ & = {2}\left( {3} \right)^{{2}} + {7}\left( {4} \right)^{{2}} + {2}\left( {4} \right)^{{2}} + {12}\left( {5} \right)^{{2}} + {1}0\left( {6} \right)^{{2}} = {822}{\text{.}} \\ \end{aligned}$$


j)
By using Redefined First Zagreb Index
$$ReZ_{1} \;\left( {G_{1} } \right) = \mathop \sum \limits_{rs \in E\left( G \right)} \frac{{\left( {dr \times ds} \right)}}{{\left( {dr + ds} \right)}},$$$$\begin{aligned} ReZ_{1} \;\left( {G_{1} } \right) & = 2\frac{1 \times 2}{{1 + 2}} + 7\frac{1 \times 3}{{1 + 3}} + 2\frac{2 \times 2}{{2 + 2}} + 12\frac{2 \times 3}{{2 + 3}} + 10\frac{3 \times 3}{{3 + 3}} \\ & = 2\frac{2}{3} + 7\frac{3}{4} + 2\frac{4}{4} + 12\frac{6}{5} + 10\frac{9}{6} = {37}.{9833}{\text{.}} \\ \end{aligned}$$ 
k)
By using Redefined Second Zagreb Index

\(\bullet \qquad {\text{ReZ}}_{{2}} \left( {{\text{G}}_{{1}} } \right) = \sum \limits_{rs \in E\left( G \right)} \left( {dr \times ds} \right)\left( {dr + ds} \right)\)
$$\begin{aligned} & = 2({1} \times {2})\;\left( {{1} + {2}} \right) + {7}\left( {{1} \times {3}} \right)\;\left( {{1} + {3}} \right) + {2}\left( {{2} \times {2}} \right)\;\left( {{2} + {2}} \right) + {12}\left( {{2} \times {3}} \right)\;\left( {{2} + {3}} \right) + {1}0\left( {{3} \times {3}} \right)\;\left( {{3} + {3}} \right) \\ & = {2} \times {2} \times {3} + {7} \times {3} \times {4} + {2} \times {4} \times {4} + {12} \times {6} \times {5} + {1}0 \times {9} \times {6} = {1028,} \\ \end{aligned}$$

Remark 3.2
The topological indices of other drugs can be obtained using a similar technique as that used in Theorem 1 and their output is provided in Table 2.
Although a lot of scholars are already calculating topological indices [43,44,45,46], we contribute by creating an efficient Python program (see Algorithm 1) to compute these indices. Especially, our technique can quickly compute through integrating edge partition values for every molecular graph in an elegant and seamless manner. This Python method advances the field with its efficiency by providing simplified procedures, improved accuracy and time saving for computing topological indices.
Theorem 1 and Algorithm 1 can both be used to compute topological indices; however algorithmic approach is more effective and beneficial in this respect. Moreover, Table 3 shows the physiochemical properties of selected drugs collected from ChemSpider [47] and PubChem [48] and the computed TIs obtained from their molecular structures by developing python algorithm respectively as seen above.
Supervised machine learning
Within the field of artificial intelligence, machine learning focuses on creating statistical models and algorithms that allow computers to learn and make decisions without explicit programming. The development of drugs usually involves machine learning techniques like Random Forest Algorithm (RFA), Extreme Gradient Boosting (XGB), and linear analysis. Linear analysis techniques like linear regression are helpful for simpler, easiertounderstand models, ensemble learning techniques like XGB and RFA are capable of managing complex nonlinear correlations and interactions in data.
Random forest
For machine learning tasks including regression, RFA is a potent ensemble learning technique. During training, it builds a large number of decision trees, and it produces the mean prediction (regression) of each individual tree. In order to begin, RF bootstraps a technique many random sections of the training set. A decision tree is trained using each subset, also referred to as a bootstrap sample. At every split point, a decision tree is built for every bootstrap sample using a random subset of features. The model performs better overall because of this randomness, which aids in decorrelation between the trees. Without any pruning, each tree is grown to its fullest depth. When every tree is constructed, its predictions are combined using the Random Forest algorithm. The following is a mathematical representation of the prediction formula for regression:
where Y′ is the predicted output, y_{1}, y_{2},…..y_{n} are the predicted outputs from individual decision trees, and n is the total number of trees in the Random Forest. Figure 4 represent the feature importance of some physiochemical properties w.r.t topological indices; also Figs. 5 and 6 illustrate the decision trees.
Violin plots highlight gaps in the data distribution and help evaluate the accuracy of predictions against actual values graphically as shown in Figs. 7, 8 and 9. RFA output error measures are shown in Table 4 and include specific parameters like Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). The following formulas can be used to determine MAE, MSE, and RMSE:

\(\bullet \qquad{\text{MAE}} = \frac{1}{{\text{n}}}\sum \left {{\text{actual}}  {\text{predicted}}} \right,\)

\(\bullet \qquad{\text{MSE}} = \frac{1}{{\text{n}}}\sum \left( {{\text{actual}}  {\text{predicted}}} \right)^{2} ,\)

\(\bullet \qquad{\text{RMSE}} = \frac{1}{{\text{n}}}\sqrt {\left( {\sum \left( {{\text{actual}}  {\text{predicted}}} \right)^{2} } \right)} .\)
The random forest algorithm’s performance and prediction accuracy were examined through information gained from both the violin plots and tables.
Linear regression
Linear regression is a fundamental supervised machine learning technique that predicts the connection between dependent variable and one or more independent variables. These models quantify the relationship between drug structures and their medical impacts through the use of various components, such as TIs. The QSPR results are constantly provided by the regression equation, which makes it an invaluable formula that is P = X + Y (TI). Here, P is the physiochemical parameter of a potential drug. Topological index, constant and regression coefficient are indicated by the symbols TI, X and Y respectively. The correlation coefficients between each topological indicator and the nine physiochemical parameters are calculated and shown in Table 5 while, bar graph representing the correlation coefficients of all physiochemical properties across different topological indices is shown in Fig. 10. Linear regression equations and physiochemical properties w.r.t TIs derived below.
Linear regression models w.r.t TIs
Regression model for [M2(G)]  Regression model for M_{1}(G) 
Molecular weight = 22.1100 + 2.1165[M_{2}(G)]  Molecular weight = 20.3377 + 2.4977[M_{1}(G)] 
Complexity = 0.2808 + 3.5990[M_{2}(G)]  Complexity = 16.339 + 4.1416[M_{1}(G)] 
Density = 1.7482 − 0.0016[M_{2}(G)]  Density = 1.7593 − 0.0020[M_{1}(G)] 
Flash point = 46.9325 + 1.5793[M_{2}(G)]  Flash point = 45.2833 + 1.8655[M_{1}(G)] 
Molar volume = − 45.305 + 1.8798[M_{1}(G)]  Molar volume = − 50.941 + 2.2409[M_{1}(G)] 
Surface tension = 79.4937 − 0.0705[M_{2}(G)]  Surface tension = 79.4230 − 0.0825[M_{1}(G)] 
Polarizability = − 1.6054 + 0.2440[M_{2}(G)]  Polarizability = − 2.3993 + 0.2913[M_{1}(G)] 
Boiling point = 154.073 + 2.6109[M_{2}(G)]  Boiling point = 151.346 + 3.0842[M_{1}(G)] 
Enthalpy of variation = 32.0454 + 0.3643[M_{2}(G)]  Enthalpy of variation = 31.5778 + 0.4309[M_{1}(G)] 
Regression model for F(G)  Regression model for H(G) 
Molecular weight = 56.1270 + 0.8636[F(G)]  Molecular weight = 14.0675 + 29.5447[H(G)] 
Complexity = 42.5485 + 1.5009[F(G)]  Complexity = 50.3265 + 46.1198[H(G)] 
Density = 1.7062 − 0.0006[F(G)]  Density = 1.79322 − 0.0254[H(G)] 
Flash point = 68.2462 + 0.6529[F(G)]  Flash point = 44.9218 + 21.7875[H(G)] 
Molar volume = − 11.6411 + 0.7598[F(G)]  Molar volume = − 66.0938 + 27.1224[H(G)] 
Surface tension = 77.8286 − 0.0277[F(G)]  Surface tension = 80.3293 − 1.0205[H(G)] 
Polarizability = 2.6997 + 0.0987[F(G)]  Polarizability = − 4.2840 + 3.5197[H(G)] 
Boiling point = 189.3117 + 1.0793[F(G)]  Boiling point = 150.7451 + 36.0198[H(G)] 
Enthalpy of variation = 36.8359 + 0.1509[F(G)]  Enthalpy of variation = 31.3317 + 5.0423[H(G)] 
Regression model for ABC  Regression model for SS(G) 
Molecular weight = 20.3827 + 16.9418[ABC(G)]  Molecular weight = 9.7915 + 11.7757[SS(G)] 
Complexity = 37.6722 + 27.2928[ABC(G)]  Complexity = 16.8938 + 19.0652[SS(G)] 
Density = 1.7710 − 0.0140[ABC(G)]  Density = 1.7819 − 0.0098[SS(G)] 
Flash point = 46.5920 + 12.6059[ABC(G)]  Flash point = 39.4157 + 8.7440[SS(G)] 
Molar volume = − 55.2781 + 15.3641[ABC(G)]  Molar volume = − 64.516 + 10.6698[SS(G)] 
Surface tension = 79.6061 − 0.5662[ABC(G)]  Surface tension = 80.0087 − 0.3948[SS(G)] 
Polarizability = − 2.9516 + 1.9965[ABC(G)]  Polarizability = − 4.1564 + 1.3866[SS(G)] 
Boiling point = 153.5087 + 20.8404[ABC(G)]  Boiling point = 141.6433 + 14.4558[SS(G)] 
Enthalpy of variation = 31.769 + 2.915[ABC(G)]  Enthalpy of variation = 30.2478 + 2.0188[SS(G)] 
Regression model for SC  Regression model for RI 
Molecular weight = 13.3437 + 27.1882[SC(G)]  Molecular weight = 16.2195 + 28.1100[RI(G)] 
Complexity = 39.4253 + 43.0217[SC(G)]  Complexity = 47.5607 + 44.2587[RI(G)] 
Density = 1.7876 − 0.0230[SC(G)]  Density = 1.7843 − 0.0238[RI(G)] 
Flash point = 43.2542 + 20.1170[SC(G)]  Flash point = 45.7931 + 20.7737[RI(G)] 
Molar volume = − 64.6865 + 24.836[SC(G)]  Molar volume = − 62.3854 + 25.6982[RI(G)] 
Surface tension = 80.1367 − 0.9262[SC(G)]  Surface tension = 80.1996 − 0.9676[RI(G)] 
Polarizability = − 4.1379 + 3.2252[SC(G)]  Polarizability = − 3.7882 + 3.3340[RI(G)] 
Boiling point = 147.989 + 33.2581[SC(G)]  Boiling point = 152.1868 + 34.3436[RI(G)] 
Enthalpy of variation = 30.9983 + 4.6526[SC(G)]  Enthalpy of variation = 31.4694 + 4.8117[RI(G)] 
Regression model for HZ  Regression model for GA 
Molecular weight = 123.0392 + 0.3994[HZ(G)]  Molecular weight = 9.4468 + 13.0085[GA(G)] 
Complexity = 170.448 + 0.6808[HZ(G)]  Complexity = 28.0960 + 20.7296[GA(G)] 
Density = 1.6200 − 0.0003[HZ(G)]  Density = 1.7901 − 0.0110[GA(G)] 
Flash point = 128.5374 + 0.2908[HZ(G)]  Flash point = 40.1194 + 9.6328[GA(G)] 
Molar volume = 52.1288 + 0.3458[HZ(G)]  Molar volume = − 67.435 + 11.8602[GA(G)] 
Surface tension = 76.9305 − 0.0142[HZ(G)]  Surface tension = 80.1620 − 0.4401[GA(G)] 
Polarizability = 11.1992 + 0.0447[HZ(G)]  Polarizability = − 4.5259 + 1.5410[GA(G)] 
Boiling point = 288.9813 + 0.4807[HZ(G)]  Boiling point = 142.8059 + 15.9243[GA(G)] 
Enthalpy of variation = 49.5430 + 0.0686[HZ(G)]  Enthalpy of variation = 30.377 + 2.2248[GA(G)] 
Regression model for ReZ_{2}  Regression model for ReZ_{1} 
Molecular weight = 57.3870 + 0.3780[ReZ_{2}(G)]  Molecular weight = 6.6707 + 11.0317[ReZ_{1}(G)] 
Complexity = 29.1286 + 0.6712[ReZ_{2}(G)]  Complexity = 3.7703 + 18.0524[ReZ_{1}(G)] 
Density = 1.6995 − 0.0003[ReZ_{2}(G)]  Density = 1.7808 − 0.0091[ReZ_{1}(G)] 
Flash point = 71.4891 + 0.2836[ReZ_{2}(G)]  Flash point = 37.1349 + 8.1906[ReZ_{1}(G)] 
Molar volume = − 7.2151 + 0.3295[ReZ_{2}(G)]  Molar volume = − 65.6712 + 9.9559[ReZ_{1}(G)] 
Surface tension = 78.3134 − 0.0126[ReZ_{2}(G)]  Surface tension = 80.1156 − 0.3699[ReZ_{1}(G)] 
Polarizability = 3.4246 + 0.0427[ReZ_{2}(G)]  Polarizability = − 4.2984 + 1.2936[ReZ_{1}(G)] 
Boiling point = 194.6729 + 0.4689[ReZ_{2}(G)]  Boiling point = 137.8725 + 13.5410[ReZ_{1}(G)] 
Enthalpy of variation = 37.6498 + 0.0655[ReZ_{2}(G)]  Enthalpy of variation = 29.7872 + 1.8895[ReZ_{1}(G)] 
Computation of statistical parameters
The use of statistical parameters to compare Topological Indices (TIs) with characteristic of correlation coefficients is useful in model analysis. In a regression model, the standard error (SE) of the estimate measures the mean variance of expected outcomes from actual values, Tables 6, 7 and 8 shows the SE, Fstatistics and significance p values. Furthermore, comparison graphs through Figs. 11, 12, 13, 14, 15, 16, 17, 18 and 19 include both actually acquired and mathematically derived physiochemical property values from regression models.
Additionally, the majority of pvalues are less than 0.05 a specific value, and mostly r exceeds 0.6 on a consistent basis as seen in Table 4.
Extreme gradient boosting
Extreme Gradient Boosting, is a powerful machine learning method that is wellknown for its efficiency in predictive mathematical modeling, here we provided Pseudocode namely Algorithm2, provides useful information about XGB, including information about its flexibility and adaptability. The distributions plot of the actual and predicted values are shown in Figs. 20, 21 and 22, which are essential for evaluating the effectiveness of the model and detecting any variations. Furthermore aiding in our understanding is the violin plot, which displays the data distribution graphically while highlighting the peculiarities specific to XGB. Table 9 also offers error estimates, which helps towards a comprehensive review of the model’s predictive power and general accuracy while using XGB algorithm, having a wellorganized overview of implementation procedures like the one provided by pseudocode proves invaluable for expediting the process and improving understanding of its complexities.
Physiochemical parameters comparison analysis
When XGB and RFA were used to forecast the physiochemical properties of antiHIV medicines, the results showed that XGB predictions consistently produced higher values than RFA. This implies that when it comes to the physiochemical characteristics of antiHIV drugs, the XGB algorithm typically yields more optimistic forecasts.
Even though these two machine learning models provide insightful information about the structure–activity relationship of associated drugs, the difference in predicted values emphasizes how crucial it is to take into account a variety of computational strategies and validation methods in order to guarantee the precision and dependability of predictions made during the drug discovery and development process. Tables 10 and 11 are the Experimental and actual data for prediction of RFA and XGB w.r.t physical properties as well as through Figs. 23, 24, 25, 26 and 27 shown the graphical comparison between XGB and RFA listed below.
Standard errors measurements like MAE, MSE, and RMSE are used to evaluate the performance of predictive models like RFA and XGB. To evaluate the relative efficiency of the models and compare the error indicators, visualizations such as tables and graphs were used. In terms of prediction accuracy, XGB performed better than RFA, as seen by lower MAE, MSE, and RMSE values. Furthermore, compared to RFA, greater R^{2} values for XGB demonstrated a better fit of the model to the data. It was easier to comprehend why XGB is such a strong algorithm for predictive modeling problems compared to the graphical representations and error tables.
Conclusions
The conclusion of our analysis gives information on the potential efficacy of the drugs under examination in treating HIV1 disease. In order to predict physiochemical properties, we compared ability to forecast of RFA, Linear Regression, and XGB in this work. Metrics including MAE, MSE, RMSE, and R^{2} values were used to assess their effectiveness. With substantially lower error rates and higher R^{2} values than the other models, XGB performed better. The efficacy of XGB was further demonstrated by graphical representations. Particularly in the treatment of HIV, the findings have important implications for drug development. Using machine learning algorithms such as XGB can improve drug property prediction efficiency. The superiority of XGB is derived from its iterative prediction refining. Some more techniques and dataset optimization may be investigated in future studies. The research contributes to largerscale predictive modeling efforts in the pharmaceutical industry. The possibilities of predictive modeling will grow with further development of machine learning techniques. Overall, this work shows that advanced algorithms can be used to improve the drug development process.
Data availability
No datasets were generated or analysed during the current study.
References
Khan MM, Khan MM. Acquired immune deficiency syndrome. In: Immunopharmacology. Cham: Springer; 2016. p. 293–330.
Sellier P, et al. Updated mortality and causes of death in 2020–2021 in people with HIV: a multicenter study in France. AIDS. 2023;37(13):2007–13.
Okoye AA, Picker LJ. CD 4+ Tcell depletion in HIV infection: mechanisms of immunological failure. Immunol Rev. 2013;254(1):54–64.
Paiardini M, MüllerTrutwin M. HIVassociated chronic immune activation. Immunol Rev. 2013;254(1):78–101.
Veazey RS. Intestinal CD4 depletion in HIV/SIV infection. Curr Immunol Rev. 2019;15(1):76–91.
Wilson NL, et al. Identifying symptom patterns in people living with HIV disease. J Assoc Nurses AIDS Care. 2016;27(2):121–32.
Joseph SB, et al. HIV1 target cells in the CNS. J Neurovirol. 2015;21:276–89.
Hu L, et al. Dualchannel hypergraph convolutional network for predicting herb–disease associations. Brief Bioinform. 2024;25(2): bbae067.
Zhao BW, et al. Motifaware miRNAdisease association prediction via hierarchical attention network. IEEE J Biomed Health Inform. 2024;28(7):4281–94.
Zhao BW, et al. iGRLDTI: an improved graph representation learning method for predicting drug–target interactions over heterogeneous biological information network. Bioinformatics. 2023;39(8): btad451.
Zhao BW, et al. A geometric deep learning framework for drug repositioning over heterogeneous information networks. Brief Bioinform. 2022;23(6): bbac384.
Lv Q, et al. TCMBank: bridges between the largest herbal medicines, chemical ingredients, target proteins, and associated diseases with intelligence text mining. Chem Sci. 2023;14(39):10684–701.
Lv Q, et al. TCMBankthe largest TCM database provides deep learningbased ChineseWestern medicine exclusion prediction. Signal Transduct Target Ther. 2023;8(1):127.
Lv Q, et al. Meta learning with graph attention networks for lowdata drug discovery. IEEE Trans Neural Netw Learn Syst. 2023;35(8):11218–30.
Lv Q, et al. Metamolnet: a crossdomain benchmark for few examples drug discovery. IEEE Trans Neural Netw Learn Syst. 2024. https://doi.org/10.1109/TNNLS.2024.335965.
Lv Q, et al. Mol2Contextvec: learning molecular representation from context awareness for drug discovery. Brief Bioinform. 2021;22(6): bbab317.
Lv Q, et al. 3D graph neural network with fewshot learning for predicting drug–drug interactions in scaffoldbased cold start scenario. Neural Netw. 2023;165:94–105.
Ahmed W, et al. A python based algorithmic approach to optimize sulfonamide drugs via mathematical modeling. Sci Rep. 2024;14(1):12264.
Zaman S, et al. On neighborhood eccentricitybased topological indices with QSPR analysis of PAHs drugs. Meas Interdiscip Res Perspect. 2024. https://doi.org/10.1080/15366367.2024.2329950.
Ahmed W, et al. Molecular insights into antiAlzheimer’s drugs through predictive modeling using linear regression and QSPR analysis. Modern Phys Lett B. 2024. https://doi.org/10.1142/S0217984924502609.
Zaman S, et al. Mathematical modeling and topological graph description of dominating David derived networks based on edge partitions. Sci Rep. 2023;13(1):15159.
Zaman S, et al. Mathematical analysis and molecular descriptors of two novel metal–organic models with chemical applications. Sci Rep. 2023;13(1):5314.
Aqib M, et al. On topological indices of some chemical graphs. Mol Phys. 2023. https://doi.org/10.1080/00268976.2023.2276386.
Bhatia KS, Gupta AK, Saxena AK. Physicochemical significance of topological indices: importance in drug discovery research. Curr Top Med Chem. 2023;23(29):2735–42.
Zanni R, et al. What place does molecular topology have in today’s drug discovery? Expert Opin Drug Discov. 2020;15(10):1133–44.
Ullah A, Bano Z, Zaman S. Computational aspects of two important biochemical networks with respect to some novel molecular descriptors. J Biomol Struct Dyn. 2024;42(2):791–805.
Ullah A, et al. Predictive potential of KBanhatti and Zagreb type molecular descriptors in structure–property relationship analysis of some novel drug molecules. J Chin Chem Soc. 2024;71(3):250–76.
Zaman S, et al. Threedimensional structural modelling and characterization of sodalite material network concerning the irregularity topological indices. J Math. 2023;2023(1):5441426.
Zhang X, et al. The study of curve fitting models to analyze some degreebased topological indices of certain anticancer treatment. Chem Pap. 2024;78(2):1055–68.
Meharban S, et al. Molecular structural modeling and physical characteristics of antibreast cancer drugs via some novel topological descriptors and regression models. Curr Res Struct Biol. 2024;7: 100134.
Patel HM, et al. Quantitative structure–activity relationship (QSAR) studies as strategic approach in drug discovery. Med Chem Res. 2014;23:4991–5007.
Zaman S, et al. QSPR analysis of some novel drugs used in blood cancer treatment via degree based topological indices and regression models. Polycycl Aromat Compd. 2023;44:1–17.
Hakeem A. et al. QSPR analysis of some novel drugs used for cardiovascular diseases through degreebased topological indices and regression models. 2023.
Gutman I, Polansky OE. Mathematical concepts in organic chemistry. Berlin: Springer Science & Business Media; 2012.
Fajtlowicz S. On conjectures of GraffitiII. Congr Numer. 1987;60:187–97.
Furtula B, Gutman I. A forgotten topological index. J Math Chem. 2015;53(4):1184–90.
Zhao W, et al. Computing SS index of certain dendrimers. J Math. 2021;2021:1–14.
Ashraful Alam M, et al. Degreebased entropy for a nonkekulean benzenoid graph. J Math. 2022;2022:1–12.
Gutman I, Furtula B, Katanić V. Randić index and information. AKCE Int J Graphs Comb. 2018;15(3):307–12.
Farahani MR. On the Randic and sumconnectivity index of nanotubes. Ann West Univ TimisoaraMath Comput Sci. 2013;51(2):39–46.
Shirdel GH, Rezapour H, Sayadi AM. The hyperzagreb index of graph operations. Iran J Math Chem. 2013;4(2):213–20.
Ranjini P, Lokesha V, Usha A. Relation between phenylene and hexagonal squeeze using harmonic index. Int J Graph Theory. 2013;1(4):116–21.
Havare ÖÇ. Topological indices and QSPR modeling of some novel drugs used in the cancer treatment. Int J Quantum Chem. 2021;121(24): e26813.
Kirmani SAK, Ali P, Azam F. Topological indices and QSPR/QSAR analysis of some antiviral drugs being investigated for the treatment of COVID19 patients. Int J Quantum Chem. 2021;121(9): e26594.
Gnanaraj LRM, Ganesan D, Siddiqui MK. Topological indices and QSPR analysis of NSAID drugs. Polycycl Aromat Compd. 2023;43(10):9479–95.
Huang L, et al. Topological indices and QSPR modeling of new antiviral drugs for cancer treatment. Polycycl Aromat Compd. 2023;43(9):8147–70.
Pence HE, Williams A. ChemSpider: an online chemical information resource. Washington, DC: ACS Publications; 2010.
Kim S, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–9.
Acknowledgements
The authors extend their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TUDSPP202494).
Funding
This research was funded by Taif University, Saudi Arabia, Project No. TUDSPP202494).
Author information
Authors and Affiliations
Contributions
All the authors Wakeel Ahmed, Shahid Zaman, Eizzah Asif, Kashif Ali, Emad E. Mahmoud and Mamo Abebe Asheboss have equally contributed to this manuscript in all stages, from conceptualization to the writeup of final draft.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
The manuscript has been approved by all authors and consent for publication has been granted.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Ahmed, W., Zaman, S., Asif, E. et al. Exploring the role of topological descriptors to predict physicochemical properties of antiHIV drugs by using supervised machine learning algorithms. BMC Chemistry 18, 167 (2024). https://doi.org/10.1186/s13065024012664
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13065024012664