Skip to main content

SMILES-based QSAR virtual screening to identify potential therapeutics for COVID-19 by targeting 3CLpro and RdRp viral proteins

Abstract

The COVID-19 pandemic has prompted the medical systems of many countries to develop effective treatments to combat the high rate of infection and death caused by the disease. Within the array of proteins found in SARS-CoV-2, the 3 chymotrypsin-like protease (3CLpro) holds significance as it plays a crucial role in cleaving polyprotein peptides into distinct functional nonstructural proteins. Meanwhile, RNA-dependent RNA polymerase (RdRp) takes center stage as the key enzyme tasked with replicating the viral genomic RNA within host cells. These proteins, 3CLpro and RdRp, are deemed optimal subjects for QSAR modeling due to their pivotal functions in the viral lifecycle. In this study, SMILES-based QSAR classification models were developed for a dataset of 2377 compounds that were defined as either active or inactive against 3CLpro and RdRp. Pharmacophore (PH4) and QSAR modeling were used for the virtual screening on 60.2 million compounds including ZINC, ChEMBL, Molport, and MCULE databases to identify new potent inhibitors against 3CLpro and RdRp. Then, a filter was established based on typical molecular characteristics to identify drug-like molecules. The molecular docking was also performed to evaluate the binding affinity of 156 AND 51 potential inhibitors to 3CLpro and RdRp, respectively. Among the 15 hits identified based on molecular docking scores, M3, N2, and N4 were identified as promising inhibitors due to their good synthetic accessibility scores (3.07, 3.11, and 3.29 out of 10 for M3, N2, and N4 respectively). These compounds contain amine functional groups, which are known for their crucial role in the binding interactions between drugs and their targets. Consequently, these hits have been chosen for further biological assay studies to validate their activity. They may represent novel 3CLpro and RdRp inhibitors possessing drug-like properties suitable for COVID-19 therapy.

Peer Review reports

Introduction

COVID-19, caused by the SARS-CoV-2 virus, has affected over 768 million people worldwide and resulted in more than 6 million deaths [1]. The pandemic has highlighted the urgent need for effective therapeutics to combat emerging viral diseases, given its devastating impact on global health and the economy. While the rapid development of vaccines mitigated the severity of the outbreak, a continued demand persists for antiviral treatments to manage infections, reduce transmission, and address new viral variants. As a result, drug discovery remains a central focus of ongoing research efforts aimed at combating both current and future viral outbreaks.

One promising approach in drug discovery is targeting specific viral proteins essential for viral replication and conserved across different strains of coronaviruses [2]. Notable targets include the 3-chymotrypsin-like protease (3CLpro) and RNA-dependent RNA polymerase (RdRp), both of which play crucial roles in the replication cycle of SARS-CoV-2 [3, 4]. The 3-chymotrypsin-like protease (3CLpro) is a crucial enzyme in the SARS-CoV-2 life cycle, responsible for cleaving the viral polyprotein into functional components essential for replication. Its unique catalytic Cys-His dyad and conserved active site, capable of accommodating multiple substrates, make it an ideal target for antiviral drug development [5]. Similarly, the RNA-dependent RNA polymerase (RdRp), another key viral enzyme, facilitates viral RNA synthesis and is highly conserved across coronaviruses, making it a pivotal target for therapeutics [6]. Together, 3CLpro and RdRp play essential roles in viral replication, positioning them as prime targets for developing COVID-19 treatments. Targeting these proteins has shown promise in identifying therapeutic candidates, but conventional drug discovery methods face significant limitations, including high costs, time inefficiency, and low success rates [7, 8]. Consequently, more rapid and cost-effective approaches are needed to identify potential antiviral compounds while ensuring their efficacy and safety [2].

Several anti-RNA polymerase drugs currently available, such as Ribavirin [9], Galidesivir [10], Remdesivir [11], and Tenofovir [12], have been approved for treating various viral infections. These drugs are now being evaluated for their effectiveness against SARS-CoV-2 RNA-dependent RNA polymerase (RdRp). Regarding the 3CLpro target, numerous studies and ongoing clinical trials have highlighted drugs like Lopinavir [13], Darunavir [14], Ritonavir [15], Ganovo [16], and Cobicistat [17]. Among these, the combination of Ritonavir/Lopinavir (LPV) is frequently tested in clinical trials for COVID-19 treatment. While there is some evidence suggesting LPV’s potential efficacy, its significant side effects are a major concern [18, 19]. Moreover, these findings underscore the importance of RdRp and 3CLpro as crucial targets for drug development against SARS-CoV-2, with inhibiting their activity emerging as a promising therapeutic strategy.

To address these challenges, quantitative structure–activity relationship (QSAR) machine learning models have emerged as a promising tool for accelerating drug discovery. QSAR models can predict the activity of compounds against specific targets based on their molecular properties, enabling the rapid screening of large compound libraries. These models have been successfully applied in identifying potential therapeutics for various diseases, including cancer and Alzheimer’s [20,21,22].

In QSAR classification, machine learning algorithms classify compounds based on their chemical structure and predicted activity. The goal is to build a model that can accurately predict a compound’s activity against a target by analyzing its structural features, such as molecular weight and hydrophobicity. The model is trained on a dataset of compounds with known activity, which is divided into training and test sets. The performance of the model is evaluated using metrics like sensitivity, specificity, and accuracy, ultimately leading to a binary classification (active or inactive) for each compound in the test set [23, 24].

A particularly advantageous approach is using CORAL (Consensus Modeling for Assessing Chemical Toxicity) and SMILES-based QSAR models. These models can handle large and diverse chemical datasets efficiently, thanks to their use of the Simplified Molecular Input Line Entry System (SMILES) notation. This molecular representation captures key structural features while reducing computational costs, making these models well-suited for virtual screening and prioritizing compounds for experimental testing. The CORAL model, which integrates multiple QSAR models and descriptors to generate consensus predictions, further enhances accuracy and robustness [25,26,27].

In this study, we leverage a dataset compiled by Ivanov et al. [28], which includes compounds tested for activity against the viral targets 3CLpro and RdRp. By applying QSAR machine learning models, we aim to identify additional compounds with the potential to serve as therapeutics for COVID-19 and related viral infections. An overview of the steps of the present study is shown in Fig. 1.

Fig. 1
figure 1

Process of molecular modeling in the present study

Materials and methods

Data collection

2377 molecules were selected from an article published by Julian Ivanov and colleagues in 2020 to investigate their inhibitory potency on the COVID-19 virus proteases [28]. The Simplified Molecular Input Line Entry System (SMILES) strings and IC50 values used for CORAL input were directly retrieved from the supplementary files of this article. Initially, the CORAL software checks for duplicated chemicals, incorrect SMILES, and inconsistencies in activity data notation based on SMILES. The SMILES format of the compounds was presented in Table S1 and S2. These molecules include 1168 molecules for 3CLpro, consisting of 468 active and 700 inactive molecules, and 1209 molecules for RdRp, consisting of 464 active and 745 inactive molecules. Compounds with an IC50 of ≤ 10µM were classified as active compounds, and compounds with an IC50 of ≥ 10µM were classified as inactive compounds. In this study, “semi-correlation” were constructed for ten distinct divisions, as opposed to traditional correlations (Fig. 2)[29, 30]. The splits’ identities were also calculated and are displayed in Table 1. It’s important to mention that there are no pairs of splits with an identity exceeding 35%.

Fig. 2
figure 2

Visualization of the general concepts of the traditional correlation and semi-correlation

Table 1 Percentages of identity for random splits

CORAL method

The chemical elements in the molecular structure were encoded as symbols for cycles and branching using SMILES attributes, and CORAL software was utilized to build models based on this representation [31]. The CORAL software, which can be downloaded for free from http://www.insilico.eu/coral, is a computational tool that utilizes Monte Carlo methods to develop regression and classification models based on SMILES descriptors and their corresponding endpoints (i.e. active or inactive). To build the models, the compounds were randomly divided into four datasets: a training set (30%), an invisible training set (30%), a calibration set (20%), and a validation set (20%). To create the QSAR model, data from the training set was utilized [30]. The training set is the foundation for constructing the model or “builder”. Monte Carlo optimization is being employed to adjust correlation weights for molecular features derived from SMILES associated with this particular set. The invisible training set is the “inspector” of the model. Computing descriptors for SMILES within this set should either confirm or reject the model's appropriateness for substances not directly engaged in the optimization procedure. The calibration set should identify the onset of overfitting. Based on computational experiments, it's evident that optimization enhances the correlation between descriptors and an endpoint for both the training and invisible training sets. However, as the optimization progresses through more epochs, there's a gradual decrease in the correlation coefficient between descriptors and the endpoint for the calibration set [32]. The validation set was used to test the predictability of the QSAR model. The CORAL software utilized identical algorithms to compute the classification models. The regression models were established on genuine correlations, while the classification models relied on pseudo correlations. The dataset for the classification models was allocated either a value of 1, indicating “active,” or a value of 0, indicating “inactive.”

Optimal descriptor

Molecular descriptors, which are mathematical entities, encode the chemical and physical properties of molecules. Choosing an appropriate set of descriptors is crucial as it determines the accuracy of predictive models. 2D representations based on molecular graphs contain around 70–80% of the information in QSAR models. Fragment-based QSAR calculates descriptors for molecular fragments generated by predefined rules. 2D QSAR is simpler and less time-consuming than other methods. Useful fragments can be identified and mapped onto molecules to suggest improvements.

CORAL software offers three distinct descriptor of correlation weights (DCW) types: graph-based, SMILES-based, and hybrid [33]. The graph-based and SMILES-based DCWs are calculated solely based on their respective input types, while the hybrid DCW is generated using both graph and SMILES inputs. To compute the optimal descriptors using SMILES, the following formula is utilized:

$${}^{SMILES}DCW= \sum_{k=1}^{N}CW\left({S}_{k}\right)+\sum_{k=1}^{N-1}CW\left({SS}_{k}\right)+ \sum_{k=1}^{N-2}CW\left({SSS}_{k}\right)+ \sum_{k}CW\left(NOS{P}_{k}\right)+\sum_{k}CW\left(HAL{O}_{k}\right)+ \sum_{k}CW\left(BON{D}_{k}\right)+\sum_{k}CW\left(HAR{D}_{k}\right)$$
(1)

Table 2 contains a detailed description of the SMILES attributes invariants used in Eq. (1).

Table 2 The detailed description of SMILES attributes

In the context of SMILES fragments, the threshold is a parameter used to distinguish between rare and non-rare fragments. If the number of SMILES containing a particular fragment in the training set is less than the threshold, it is considered rare; otherwise, it is classified as non-rare. During the Monte Carlo optimization process, one epoch refers to a complete cycle of modifying all correlation weights. The correlation weights, which determine the maximum value of the correlation coefficient between the endpoint (0,1) and DCW (Threshold, Nepoch), are obtained numerically using the Monte Carlo method.

The target function of the optimization is to find the correlation coefficient between the optimal descriptor and an endpoint. This is achieved through the CORAL model, which establishes a linear relationship between a predicted endpoint Y and a descriptor of correlation weights (DCW). The DCW is represented by the following mathematical equation:

$$Y={C}_{0}+{C}_{1}{\times} DCW\left({T}^{*},{N}^{*}\right)$$
(2)

Using the least squares method, we can obtain two regression coefficients, C0 and C1, which are used to create an optimal DCW model based on the dataset. During the Monte Carlo optimization process, the parameters T* and N* are used, where T* represents the threshold value and N* represents the number of epochs. In DCW modeling, T refers to the threshold number of active attributes. For example, if T is set to four, any attributes that appear in less than four molecules are considered inactive, and their correlation weight will be zero [34,35,36].

Statistical criteria

For the binary classification model where “active” is represented by 1 and “inactive” by 0, several statistical measures are used to evaluate the model's performance. These measures include the Matthews correlation coefficient (MCC), sensitivity, specificity, and accuracy, which are calculated as follows:

$$Sensitivity=\frac{TP}{TP+FN}$$
(3)
$$Specificity=\frac{TN}{TN+FP}$$
(4)
$$Accuracy=\frac{TP+TN}{TP+FP+FN+TN}$$
(5)
$$MCC=\frac{TP{\times} TN-FP{\times} FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$
(6)

A confusion matrix is a table used to assess the performance of a classification model. It includes the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) predictions made by the model [30].

The performance measures used in evaluating a classification model are important indicators of its quality. One such measure is the Matthews correlation coefficient (MCC), which is commonly used for binary classification problems [37]. The MCC is similar to the traditional correlation coefficient but is designed specifically for this case. The MCC coefficient, applied in machine learning as a balanced measure of the quality of binary classifications, proving beneficial even when the classes vary significantly in size. In practical terms, a model is considered good if the MCC 1, or generally, the MCC should be greater than 0.6 for it to be effective. Another important measure is sensitivity, which assesses the model’s ability to correctly identify positive observations, such as active compounds. Specificity, on the other hand, evaluates the model’s ability to accurately identify negative observations, such as inactive compounds. Finally, accuracy is a measure of the overall performance of the model, taking into account its ability to predict both positive and negative observations. These measures are crucial in determining the effectiveness of a classification model and ensuring its ability to make accurate predictions.

Applicability domain

The OECD QSAR validation principles require QSAR models to be used within their applicability domain (AD), which refers to the space or knowledge used to develop the model and make predictions for new compounds. Regarding CORAL models, the domain of applicability is determined by analyzing the statistical defects of SMILES, which are calculated based on the distribution of available data in the training, invisible training, calibration, and validation sets [38]. To define the domain of applicability, the defect of the SMILES attribute is measured as the difference between the probability of the attribute in the training set and its probability in the calibration set. The total SMILES-defect is obtained by summing up the defects of all attributes. If the SMILES-defect of a particular SMILES is less than double the average defect of compounds in the training set, it is considered to be within the domain of applicability; otherwise, it falls outside the domain of applicability.

$${Defect}_{{A}_{k}}= \frac{\left|{P}_{TRN}(\right.{A}_{k})-\left.{P}_{CAL}\left({A}_{k}\right)\right|}{{N}_{TRN}\left({A}_{k}\right)+ {N}_{CAL}\left({A}_{k}\right)}, if \,{N}_{TRN}\left({A}_{k}\right)>0 \,{Defect}_{{A}_{k}}=1, if \,{N}_{TRN}\left({A}_{k}\right)=0$$
(7)

The probabilities of attribute A in the training and calibration sets are denoted by \({P}_{TRN}\left({A}_{k}\right)\) and \({P}_{CAL}\left({A}_{k}\right)\), respectively. On the other hand, the frequencies of A in the training and calibration sets are represented by \({N}_{TRN}\left({A}_{k}\right)\) and \({N}_{CAL}\left({A}_{k}\right)\), respectively.

In the SMILES notation, the statistical defect (D) is the total sum of all statistical defects of all attributes.

$${Defect}_{Molecule}=\sum_{k=1}^{NA}{Defect}_{{A}_{k} }$$
(8)

In CORAL, the number of active SMILES attributes in a compound is denoted by NA. If a compound is outside the domain of applicability, it is considered an outlier. The detection of outliers in CORAL is based on the condition expressed in inequality 9.

$${Defect}_{Molecule} >2{\times} \overline{{Defect }_{TRN}}$$
(9)

The mean value of statistical defects calculated for the training dataset is referred to as \({Defect}_{TRN} D.\)

Virtual screening workflow

Virtual screening is a computational technique used in drug discovery and development to identify potential drug candidates through in-silico (computer-based) screening of large libraries of small molecules. The process involves the use of computer algorithms and molecular modeling techniques to predict the potential binding affinity of small molecules with target proteins, and thus, identify molecules with a high likelihood of being effective drug candidates [39]. The Pharmit webserver was used to conduct in silico virtual screening on four databases (ChEMBL, ZINC, MCULE, and MolPort) in building pharmacophore (PH4) model and the identification of drug-like molecules stages [40]. Pharmit is an online interactive server that allows for the virtual screening of various compound databases using pharmacophore models and molecular shapes. The reference article identified several molecules with in vitro inhibitory effects on 3CLpro, among which 1H-Indole-2-carboxylic acid, 5-fluoro-, 1H-benzotriazol-1-yl ester (A1) with an IC50 of 0.013 was considered the most active compound. Similarly, 3-[Isopropyl(trans-4-methylcyclohexylcarbonyl)amino]-5 phenylthiophene -2 carboxylic acid (A2) was considered the most active compound with in vitro inhibitory effect on RdRp, with an IC50 of 0.009 [28]. To perform the structure-based pharmacophore screening, we utilized these two most active compounds A1 and A2 in the Pharmit server. The QSAR models were selected to conduct virtual screening. Pharmit filters compounds during hit screening based on drug-like properties including number of rotatable bonds, molecular weight, logP, topological polar surface area, number of HBAs, number of HBDs, and number of aromatic groups. The specified ranges for these properties are: MW ≤ 500, rotatable bonds ≤ 10, logP ≤ 5, PSA ≤ 140 Å2, aromatic groups ≤ 5, 2 ≤ HBA ≤ 7, and 2 ≤ HBD ≤ 7. The binding modes of inhibitors, along with the critical molecular interactions inside 3CLpro and RdRp active sites with PDB ID 6Y2F and 6NUR, were investigated using the Smina [41] molecular docking package. A purification process was performed by removing all heteroatoms and solvent molecules from the structure. Polar hydrogens were then added to the PDB files. All ligands for docking were sketched using Discovery Studio 2020, and assigned gasteiger charges and energy optimization of ligands using the steepest descent algorithm carried out by Open Babel [42]. The details of the molecular docking algorithm in Smina have been explained in previous studies [36, 43]. Visualization and interaction analyses were performed using the Discovery Studio 2020 viewer.

Results and discussion

QSAR models

For the 3CLpro enzyme, a collection of 1168 molecules were used, with 468 molecules known to be active against the enzyme and 700 molecules known to be inactive. These molecules were split into four different groups: a training set of 339 molecules, an invisible training set of 368 molecules, a calibration set of 230 molecules, and a validation set of 231 molecules. Similarly, for the RdRp enzyme, a group of 1209 molecules were used, with 464 known to be active and 745 known to be inactive. These molecules were also divided into four groups: a training set of 358 molecules, an invisible training set of 386 molecules, a calibration set of 225 molecules, and a validation set of 240 molecules. To begin with, the process of parameter optimization was carried out in order to determine the most suitable values for the threshold and number of epochs (Nepoch). This was done to ensure that the model would produce accurate predictions while minimizing the risk of overfitting.

To analyze the data set for proteases 3CLpro and RdRp, it was divided into ten separate parts referred to as “splits” (split 1, split 2, and so on up to split 10). The analysis was conducted using a Nepoch value of 30 and threshold values ranging from 1 to 3. For all ten splits of both proteases, the Monte Carlo optimization used a preferred threshold value (T*) of 1 and a preferred number of epochs (N*) of 3. The models for each of the ten splits (splits 1 through 10) were obtained using this methodology. Tables 3 and 4 displays the statistical properties of the binary classifications for 3CLpro and RdRp, with the activity being determined using the following formula:

Table 3 the statistical properties of the inhibitor activity classification models of 3CLPro that were generated through the Monte Carlo method optimization for ten random splits
Table 4 The statistical properties of the inhibitor activity classification models of RdRp that were generated through the Monte Carlo method optimization for ten random splits
$$Class=\left\{\begin{array}{c}Inhibitory \,activity \le 10\mu M, class=1(active)\\ Inhibitory \,activity >10\mu M, class=0 (inactive)\end{array}\right.$$
(10)

The majority of the models in the study achieved an accuracy, sensitivity, and specificity greater than 90% for the training, invisible training, calibration, and validation sets of 3CLpro and RdRp across splits 1–10, indicating their ability to predict the activity of the viral enzymes.

It’s important to highlight that the classification model relies on unique semi-correlations, making it unsuitable for using of some traditional criteria. For traditional correlation, an r2 value around 0.4 suggests a weak regression model. But when it comes to semi-correlation with a similar r2 value (0.4), having specificity, sensitivity, accuracy, and MCC at that level could be seen as good or even outstanding. This means most items in the chart are accurately classified, usually with just one false positive and one false negative. The CORAL software operates under the fundamental assumption that a well-performing model on a calibration set should also perform well on an external validation set. In this case, the statistical parameters of the models for the two proteases are deemed good and acceptable. Furthermore, for split #7 (bold in the Table 3) of 3CLPro and split #4 (bold in Table 4) of RdRp, the MCC values for the validation sets are among the top-performing models, with 0.9478 and 0.8890, respectively.

By running Monte Carlo optimization multiple times, one can obtain three groups of SMILES attributes: (i) attributes or promoters that only have positive correlation weights, which promote the activity of compounds; (ii) attributes or promoters that only have negative correlation weights, which promote the inactivity of compounds; and (iii) attributes or promoters that have both positive and negative correlation weights in multiple Monte Carlo runs, and their role is not yet understood. This method allows for a mechanistic interpretation of the model. Tables 5 and 6 provides a list of potential promoters that be linked to both increased and decreased activities for both proteases. The findings from Table 5 revealed that certain structural features, such as aliphatic oxygen and double bond, nitrogen and double bond, aliphatic oxygen and double bond with branching, two successive aliphatic carbon, aliphatic carbon and double bond with branching, double bond, presence of at least three rings, carbon and double bond with branching, aliphatic carbon and aliphatic nitrogen with branching, aliphatic carbon and aliphatic oxygen with branching, molecule containing nitrogen and oxygen with at least one ring and branching, were identified as significant factors for increased inhibitory activity against 3CLPro. On the other hand, decreased activity against 3CLPro was associated with certain structural features, such as nitrogen and oxygen with double bond, oxygen and double bond, three successive aliphatic carbon, aliphatic carbon and double bond with at least three rings, hydrogen and stereo specific bonds.

Table 5 A collection of SMILES attributes that can be interpreted as promoters of activity or inactivity of compounds against 3CLPro
Table 6 A collection of SMILES attributes that can be interpreted as promoters of activity or inactivity of compounds against RdRp

Table 6 presented our findings on the influential features that affect inhibitory activity against RdRp. We observed that negative charge, nitrogen and double bond, aliphatic carbon with branching, double bond, nitrogen and oxygen, double bond, aliphatic carbon and aliphatic nitrogen, aliphatic carbon and branching, presence of at least two rings, the highest number of sulfur equal to zero, oxygen and double bond, and aromatic carbon in the first ring had a positive impact on inhibitory activity. On the other hand, aliphatic carbon with branching and aliphatic oxygen, aliphatic oxygen and two branching, aliphatic carbon and branching, aliphatic nitrogen with branching and aliphatic carbon, branching with aliphatic carbon and stereo specific bonds, successive two aliphatic carbon and aliphatic nitrogen, aliphatic carbon and double bond, aromatic nitrogen in the second ring, the highest number of oxygens equal to one, two successive aliphatic carbons and ring, carbon and double bond with branching, and presence of three successive aliphatic carbon were found to decrease inhibitory activity against RdRp.

Virtual screening analysis

The Pharmit server was utilized to propose pharmacophore characteristics by emphasizing on the key residues involved in active site interactions of 3CLPro and RdRp with compound A1 and A2 as the most active compounds, respectively (Fig. 3). Among the databases supported by Pharmit, we opted to screen the ZINC, CHEMBL32, MCULE, and MolPort databases due to the availability of purchasable compounds for virtual screening. The QSAR models were selected for predicting the activity (active or inactive) of compounds. Then, the filter was established using typical molecular characteristics for recognizing drug-like molecules. These features comprise molecular weight, log P (a measurement of lipophilicity), topological polar surface area (a sign of the compound's ability to penetrate cell membranes), the number of rotatable bonds, the number of aromatic groups, the number of hydrogen bond acceptors, and the number of hydrogen bond donors. OpenBabel is utilized to precompute these features. Smina was used in the third screening to confirm that the search for the binding site of 3CLPro and RdRp had produced all hits. The compounds were then sorted based on their docking score values, and only those with scores higher than compounds A1 with 3CLPro (− 7.22 kcal/mol) and A2 with RdRp(− 7.84 kcal/mol) were included in the list. Figure 3 shows the PH4 models and interaction patterns of two complexes: one between 3CLPro and A1, and the other between RdRp and A2.

Fig. 3
figure 3

The interaction patterns and superpositions of the PH4 models for 3CLPro and RdRp with PDB ID (A) 6Y2F with A1 and (B) 6NUR with A2. PH4 features are represented using spheres colored green for hydrophobic, orange for hydrogen bond acceptor, white for hydrogen bond donor, and red for negative ion features

The results of molecular docking indicate that the A1 molecule forms hydrogen bonds with three different amino acids, namely ARG188, GLN192, and THR190. Additionally, this molecule exhibits hydrophobic interactions with amino acids PRO168, MET165, and HIS41. The amino acids GLN189 and VAL186 are also found to be involved in van der Waals interactions with the A1 molecule. The findings of molecular docking analysis between RdRp protease with molecule A2 indicated that GLN724 amino acid forms a hydrogen bond with the carbonyl group's oxygen. Additionally, LEU708 and ARG721 were observed to engage in hydrophobic interactions with the rings and branches, while THR710, GLY712, HIS725, and TYR732 were found to participate in van der Waals interactions. The results are consistent with the pharmacophoric models.

Figure 4 presents the hits obtained from virtual screening procedures across all four databases for each protein. During the final stage of screening, a certain number of compounds were subjected to docking simulations using different proteins, with 156 compounds docked into 3CLPro and 51 compounds into RdRp. The structures and molecular docking minimized affinity values of these hits were shown in Table 7.

Fig. 4
figure 4

Result diagram screening of inhibitors for (A) 3CLPro and (B) RdRp from the databases

Table 7 Hits retrieved from the virtual screening alongside their minimized affinity values

Specifically, the hit structures M3 for 3CLpro, attributed to the presence of rings and oxygen with a double bond, were found to enhance the compound's activity in the QSAR model. Additionally, N2 and N4 for RdRp, characterized by the presence of activity-promoting factors such as nitrogen and a ring in the chain, and the absence of sulfur, are highly compatible with the QSAR models. As a result, they were studied in greater detail. Upon analyzing the interaction between M3 and 3CLpro, it was observed that there were significant hydrogen bonds formed between the oxygen and nitrogen of M3 with THR25, LEU141, GLY143, SER144, and CYS145. Additionally, two hydrophobic interactions occurred between the two rings of M3 and MET49 and CYS145 (Fig. 5A). According to Fig. 5B, the N2 molecule bound to RdRp forms three hydrogen bonds from N and O atoms with ASN713, GLN724, and TYR732 amino acid residues. Furthermore, the three rings of the molecule engage in hydrophobic interactions with specific amino acid residues such as VAL128, HIS133, LEU207, LEU240, GLY712, and LEU708. The docking results of N5 with RdRp show that there is a hydrogen bond between N5 and TYR732. Hydrophobic interactions were observed between N5 and VAL128, TYR129, HIS133, LEU240, LEU708, and TYR728. Furthermore, ARG132 and ASP465 were involved in attractive charge and salt bridges with the charged part of the molecule (Fig. 5C). The M3, N2, and N4 compounds contain at least an amine functional group, which is known to play a crucial role in drug-target binding interactions. Amine groups can pass through cell membranes when in their non-ionized form, and when ionized, they exhibit favorable solubility in water. Furthermore, the fact that the three compounds are highly soluble in water makes them ideal for various drug production activities. It is worth noting that solubility is a critical feature that affects absorption, especially for discovery projects that aim for oral administration. Additionally, a drug intended for injection must have excellent water solubility to ensure that a small medicinal dose contains enough of the active substance. The three compounds possess a significant number of nitrogens with non-bonding electrons, which suggests that they exhibit hydrophilic properties. As a result, they are logical choices for inhibiting 3CLpro and RdRp and can be considered as potential candidates.

Fig. 5
figure 5

The interaction patterns models for (A) 6Y2F with M3, (B) 6NUR with N2 and (C) 6NUR with N5

ADMET properties

ADMET stands for Absorption, Distribution, Metabolism, Excretion, and Toxicity. These are critical factors that affect the pharmacokinetics (how drugs move through the body) and pharmacodynamics (how drugs interact with the body) of a drug. The SwissADME server [44] and the DataWarrior [45] software were used to compute various measures, including the bioavailability score, gastrointestinal absorption, logKp for skin permeation, water solubility and toxicity. The Abbott bioavailability score is used to assess drug-likeness, with a score of 0.55 indicating that the best-predicted hits from virtual screening passed the rule-of-five. Skin permeability, logKp was also calculated and fell within the standard range of − 1 to − 8 for 95% of drugs [46]. The range of water solubility typically reported in drug discovery and development is − 6.5 < logS < 0.5. There are various factors that regulate bioavailability, but ultimately, gastrointestinal absorption is the key determinant [47]. Among the 15 hits identified based on molecular docking scores, M3, N2, and N4 were identified as promising inhibitors due to their good synthetic accessibility scores (3.07, 3.11, and 3.29 out of 10 for M3, N2, and N4 respectively). The reported hits showed high gastrointestinal absorption, and toxicity risk was assessed for mutagenic, tumorigenic, irritant, and reproductive effects. Reproductive toxicity may cause alterations to the male and female reproductive systems, while irritant toxicity can cause reversible damage to the skin or other organs. A majority of the virtual screening hits exhibited satisfactory molecular properties, as shown in Table 8.

Table 8 The projected ADMET characteristics for the identified hits

Conclusion

The COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, poses a significant threat to global health. To support the development of effective treatments, we combined ligand-based and structure-based drug design approaches to identify potent inhibitors against SARS-CoV-2. SMILES-based classification models were used to create predictive two-dimensional QSAR models. Our comprehensive virtual screening workflow included PH4 analysis, QSAR modeling, evaluation of drug-like properties, molecular docking, and ADMET testing. The ease of using CORAL software to generate QSAR models proved advantageous for rapidly screening potential compounds, as it reduces the chemical space to a manageable size for further analysis and development. This approach identified 15 potential inhibitors, with M3, N2, and N4 emerging as the most promising candidates due to their favorable synthetic accessibility scores (3.07, 3.11, and 3.29, respectively) and the presence of amine functional groups, which are crucial for effective binding interactions with drug targets. These compounds have been selected for further biological assays to validate their efficacy. This study provides valuable insights into the development of novel inhibitors with potential therapeutic applications for COVID-19. To confirm our computational findings, experimental evaluation of the identified compounds against RdRp and 3CLpro activity using standard enzyme-based or cell-based assays is recommended.

Availability of data and materials

Data is provided within the manuscript and supplementary information files.

References

  1. Li J, Wang J, Wang H. Emerging landscape of preclinical models for studying COVID-19 neurologic diseases. ACS Pharmacol Transl Sci. 2023;6(10):1323–39.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Farha MA, Brown ED. Drug repurposing for antimicrobial discovery. Nat Microbiol. 2019;4(4):565–77.

    Article  CAS  PubMed  Google Scholar 

  3. Dai W, Jochmans D, Xie H, Yang H, Li J, Su H, Chang D, Wang J, Peng J, Zhu L. Design, synthesis, and biological evaluation of peptidomimetic aldehydes as broad-spectrum inhibitors against enterovirus and SARS-CoV-2. J Med Chem. 2021;65(4):2794–808.

    Article  PubMed  Google Scholar 

  4. Wu C, Liu Y, Yang Y, Zhang P, Zhong W, Wang Y, Wang Q, Xu Y, Li M, Li X. Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods. Acta Pharmaceutica Sinica B. 2020;10(5):766–88.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Kneller DW, Phillips G, O’Neill HM, Jedrzejczak R, Stols L, Langan P, Joachimiak A, Coates L, Kovalevsky A. Structural plasticity of SARS-CoV-2 3CL Mpro active site cavity revealed by room temperature X-ray crystallography. Nat Commun. 2020;11(1):3202.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Gao Y, Yan L, Huang Y, Liu F, Zhao Y, Cao L, Wang T, Sun Q, Ming Z, Zhang L. Structure of the RNA-dependent RNA polymerase from COVID-19 virus. Science. 2020;368(6492):779–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Xue H, Li J, Xie H, Wang Y. Review of drug repositioning approaches and resources. Int J Biol Sci. 2018;14(10):1232.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Leelananda SP, Lindert S. Computational methods in drug discovery. Beilstein J Org Chem. 2016;12(1):2694–718.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Hung I. Lopinavir/ritonavir, ribavirin and IFN-beta combination for nCoV treatment. NCT04276688. 2020.

  10. USNLo M. A Study to Evaluate the Safety, Pharmacokinetics and Antiviral Effects of Galidesivir in Yellow Fever or COVID-19. ClinicalTrialsgov. 2020.

  11. Wang Y, Zhang D, Du G, Du R, Zhao J, Jin Y, Fu S, Gao L, Cheng Z, Lu Q. Remdesivir in adults with severe COVID-19: a randomised, double-blind, placebo-controlled, multicentre trial. Lancet. 2020;395(10236):1569–78.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Polo R, Hernán M. Randomized clinical trial for the prevention of SARS-CoV-2 infection (COVID-19) in healthcare personnel (EPICOS). ClinicalTrials. gov; 2020.

  13. Wang Z, Xu X. scRNA-seq profiling of human testes reveals the presence of the ACE2 receptor, a target for SARS-CoV-2 infection in spermatogonia, Leydig and Sertoli cells. Cells. 2020;9(4):920.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. De Meyer S, Bojkova D, Cinatl J, Van Damme E, Buyck C, Van Loock M, Woodfall B, Ciesek S. Lack of antiviral activity of darunavir against SARS-CoV-2. Int J Infect Dis. 2020;97:7–10.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Ye X-T, Luo Y-L, Xia S-C, Sun Q-F, Ding J-G, Zhou Y, Chen W, Wang X-F, Zhang W-W, Du W-J. Clinical efficacy of lopinavir/ritonavir in the treatment of Coronavirus disease 2019. Eur Rev Med Pharmacol Sci. 2020;24(6):3390–6.

    PubMed  Google Scholar 

  16. Chen YW, Yiu CPB, Wong K-Y. Prediction of the SARS-CoV-2 (2019-nCoV) 3C-like protease (3CL pro) structure: virtual screening reveals velpatasvir, ledipasvir, and other drug repurposing candidates. F1000Research. 2020;9:129.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Nabirotchkin S, Peluffo AE, Bouaziz J, Cohen D. Focusing on the unfolded protein response and autophagy related pathways to reposition common approved drugs against COVID-19. Basel: MDPI AG; 2020.

    Book  Google Scholar 

  18. Verdugo-Paiva F, Izcovich A, Ragusa M, Rada G. C.-L.-OW Group, Lopinavir/ritonavir for the treatment of COVID-19: a living systematic review protocol. medRxiv. 2020;9:399.

    Google Scholar 

  19. Tobaiqy M, Qashqary M, Al-Dahery S, Mujallad A, Hershan AA, Kamal MA, Helmi N. Therapeutic management of patients with COVID-19: a systematic review. Infect Prevent Pract. 2020;2(3):100061.

    Article  CAS  Google Scholar 

  20. Vatansever S, Schlessinger A, Wacker D, Kaniskan HÜ, Jin J, Zhou MM, Zhang B. Artificial intelligence and machine learning-aided drug discovery in central nervous system diseases: state-of-the-arts and future directions. Med Res Rev. 2021;41(3):1427–73.

    Article  PubMed  Google Scholar 

  21. Prachayasittikul V, Worachartcheewan A, Toropova A, Toropov A, Schaduangrat N, Prachayasittikul V, Nantasenamat C. Large-scale classification of P-glycoprotein inhibitors using SMILES-based descriptors. SAR QSAR Environ Res. 2017;28(1):1–16.

    Article  CAS  PubMed  Google Scholar 

  22. Tsou LK, Yeh S-H, Ueng S-H, Chang C-P, Song J-S, Wu M-H, Chang H-F, Chen S-R, Shih C, Chen C-T. Comparative study between deep learning and QSAR classifications for TNBC inhibitors and novel GPCR agonist discovery. Sci Rep. 2020;10(1):16771.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Lo Y-C, Rensi SE, Torng W, Altman RB. Machine learning in chemoinformatics and drug discovery. Drug Discov Today. 2018;23(8):1538–46.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Shiri F, Pirhadi S, Rahmani A. Identification of new potential HIV-1 reverse transcriptase inhibitors by QSAR modeling and structure-based virtual screening. J Recept Signal Transduct. 2018;38(1):37–47.

    Article  CAS  Google Scholar 

  25. Cappelli CI, Toropov AA, Toropova AP, Benfenati E. Ecosystem ecology: Models for acute toxicity of pesticides towards Daphnia magna. Environ Toxicol Pharmacol. 2020;80:103459.

    Article  CAS  PubMed  Google Scholar 

  26. Lotfi S, Ahmadi S, Zohrabi P. QSAR modeling of toxicities of ionic liquids toward Staphylococcus aureus using SMILES and graph invariants. Struct Chem. 2020;31:2257–70.

    Article  CAS  Google Scholar 

  27. Manganelli S, Benfenati E, Manganaro A, Kulkarni S, Barton-Maclaren TS, Honma M. New quantitative structure–activity relationship models improve predictability of Ames mutagenicity for aromatic azo compounds. Toxicol Sci. 2016;153(2):316–26.

    Article  CAS  PubMed  Google Scholar 

  28. Ivanov J, Polshakov D, Kato-Weinstein J, Zhou Q, Li Y, Granet R, Garner L, Deng Y, Liu C, Albaiu D. Quantitative structure–activity relationship machine learning models and their applications for identifying viral 3CLpro-and RdRp-targeting compounds as potential therapeutics for COVID-19 and related viral infections. ACS Omega. 2020;5(42):27344–58.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Toropov A, Toropova A, Benfenati E, Gini G, Leszczynska D, Leszczynski J. CORAL: classification model for predictions of anti-sarcoma activity. Curr Top Med Chem. 2012;12(24):2741–4.

    Article  CAS  PubMed  Google Scholar 

  30. Toropov AA, Toropova AP, Rasulev BF, Benfenati E, Gini G, Leszczynska D, Leszczynski J. CORAL: binary classifications (active/inactive) for liver-related adverse effects of drugs. Curr Drug Saf. 2012;7(4):257–61.

    Article  CAS  PubMed  Google Scholar 

  31. Toropova AP, Toropov AA. CORAL: binary classifications (active/inactive) for drug-induced liver injury. Toxicol Lett. 2017;268:51–7.

    Article  CAS  PubMed  Google Scholar 

  32. Toropova AP, Toropov AA. QSPR and nano-QSPR: what is the difference? J Mol Struct. 2019;1182:141–9.

    Article  CAS  Google Scholar 

  33. Ahmadi S. Mathematical modeling of cytotoxicity of metal oxide nanoparticles using the index of ideality correlation criteria. Chemosphere. 2020;242:125192.

    Article  CAS  PubMed  Google Scholar 

  34. Lotfi S, Ahmadi S, Kumar P. A hybrid descriptor based QSPR model to predict the thermal decomposition temperature of imidazolium ionic liquids using Monte Carlo approach. J Mol Liq. 2021;338:116465.

    Article  CAS  Google Scholar 

  35. Kumar A, Kumar P. Cytotoxicity of quantum dots: Use of quasiSMILES in development of reliable models with index of ideality of correlation and the consensus modelling. J Hazard Mater. 2021;402:123777.

    Article  CAS  PubMed  Google Scholar 

  36. Soleymani N, Ahmadi S, Shiri F, Almasirad A. QSAR and molecular docking studies of isatin and indole derivatives as SARS 3CLpro inhibitors. BMC Chem. 2023;17(1):32.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Toropova AP, Toropov AA, Benfenati E. Semi-correlations as a tool to model for skin sensitization. Food Chem Toxicol. 2021;157:112580.

    Article  CAS  PubMed  Google Scholar 

  38. Javidfar M, Ahmadi S. QSAR modelling of larvicidal phytocompounds against Aedes aegypti using index of ideality of correlation. SAR QSAR Environ Res. 2020;31(10):717–39.

    Article  CAS  PubMed  Google Scholar 

  39. Ghasemi JB, Shiri F, Pirhadi S, Heidari Z. Discovery of new potential antimalarial compounds using virtual screening of ZINC database. Comb Chem High Throughput Screen. 2015;18(2):227–34.

    Article  CAS  PubMed  Google Scholar 

  40. Sunseri J, Koes DR. Pharmit: interactive exploration of chemical space. Nucleic Acids Res. 2016;44(W1):W442–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Koes DR, Baumgartner MP, Camacho CJ. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J Chem Inf Model. 2013;53(8):1893–904.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open babel: an open chemical toolbox. J Cheminform. 2011;3:1–14.

    Article  Google Scholar 

  43. Hashemizadeh M, Shiri F, Shahraki S, Razmara Z. A multidisciplinary study for investigating the interaction of an iron complex with bovine liver catalase. Appl Organomet Chem. 2022;36(11): e6881.

    Article  CAS  Google Scholar 

  44. Daina A, Michielin O, Zoete V. SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci Rep. 2017;7(1):42717.

    Article  PubMed  PubMed Central  Google Scholar 

  45. Sander T, Freyss J, von Korff M, Rufener C. DataWarrior: an open-source program for chemistry aware data visualization and analysis. J Chem Inf Model. 2015;55(2):460–73.

    Article  CAS  PubMed  Google Scholar 

  46. Ntie-Kang F. An in silico evaluation of the ADMET profile of the StreptomeDB database. Springerplus. 2013;2:1–11.

    Article  Google Scholar 

  47. Newby D, Freitas AA, Ghafourian T. Decision trees to characterise the roles of permeability and solubility on the prediction of oral absorption. Eur J Med Chem. 2015;90:751–65.

    Article  CAS  PubMed  Google Scholar 

Download references

Funding

This work was funded by the University of Zabol [Grant code: IR-UOZ-GR-0144].

Author information

Authors and Affiliations

Authors

Contributions

Faezeh Bazzi-Allahri: Visualization, Methodology, Formal analysis. Fereshteh Shiri: Writing—review & editing, Writing—original draft, Supervision, Funding acquisition, Conceptualization. Shahin Ahmad: review & editing, Validation, Software. Alla P. Toropova: review & editing, software. Andrey A. Toropov: review & editing, software.

Corresponding author

Correspondence to Fereshteh Shiri.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bazzi-Allahri, F., Shiri, F., Ahmadi, S. et al. SMILES-based QSAR virtual screening to identify potential therapeutics for COVID-19 by targeting 3CLpro and RdRp viral proteins. BMC Chemistry 18, 191 (2024). https://doi.org/10.1186/s13065-024-01302-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13065-024-01302-3

Keywords