QSAR based on hybrid optimal descriptors as a tool to predict antibacterial activity against Staphylococcus aureus

Background : Staphylococcus aureus bacterial infections are still a serious health care problem. Therefore, the development of new drugs for these infections is a constant requirement. Quantitative structure–activity relationship (QSAR) methods can assist this development. Methods : The study included 151 structurally diverse compounds with antibacterial activity against S. aureus ATCC 25923 (Endpoint 1) or the drug-resistant clinical isolate of S. aureus (Endpoint 2). QSARs based on hybrid optimal descriptors were used. Results : The predictive potential of developed models has been checked with three random splits into training, passive training, calibration, and validation sets. The proposed models give satisfactory predictive models for both endpoints examined. Conclusions : The results of the study show the possibility of SMILES-based QSAR in the evaluation of the antibacterial activity of structurally diverse compounds for both endpoints. Although the developed models give satisfactory predictive models for both endpoints examined, splitting has an apparent influence on the statistical quality of the models.

The dramatic increase in numerous multidrugresistant bacterial infections in recent decades has become a serious health care problem. In particular, multidrugresistant strains of Gram-positive bacterial pathogens, namely Staphylococcus aureus, which dominate worldwide bacterial infection rates, are a problem of very serious significance [11,12]. Although various antimicrobial drugs are used in treatment, a high mortality rate is still a serious problem in S. aureus bacteremia, and the development of new drugs or the elaboration of new types of previously known drugs remains a very actual task [13][14][15].
In our previous work, we have dealt with the synthesis of new antibacterial agents, determined their minimum inhibitory concentrations (MIC) against a number of microorganisms, and evaluated their properties using various QSAR approaches. First, in 2010 a novel series of N- (2-hydroxyphenyl)benzamides and N-(2hydroxyphenyl)-2-phenylacetamides was synthesized ( Fig. 1A) [16]. The microbiological results indicated that they possess a broad spectrum of activity against various pathogens (MIC values between 1.95 and 500 µg/mL). A follow-up study [17] using classical QSAR and 3Dcommon-feature pharmacophore hypothesis approaches showed that the insertion of a methylene group between the phenyl and carboxyamido moiety decreases MIC. In contrast, the substituent at position R 1 is important for the increase of activity, and similarly, substituting position R 3 with a group enhancing the electron-donor capability of the phenolic ring system increased the potency of a compound. Finally, it was found that the benzamide derivatives exhibited the greatest activity against drug-resistant bacteria, including S. aureus. These findings led to the synthesis of the series of 2-(p-substituted benzyl)-5-(2-substituted acetamido)benzoxazoles (Fig. 1B) [18]. The microbiological assay showed that the compounds possessed a large spectrum of MIC between 7.8-250 µg/mL. The 2D-QSAR showed that the width and hydrophobicity of the R 1 substituent are directly proportional to MIC against methicillin-resistant S. aureus. No methylene bridge should take place between the benzoxazole moiety and the p-substituted phenyl group. These observations led to the design of the next series of 5(or 6)-nitro/amino-2-(substituted phenyl/benzyl)benzoxazoles ( Fig. 1C) [19]. Antibacterial evaluation indicated a broad spectrum of activity against the tested microorganisms (including S. aureus) with MIC values between 12.5  and >400 µg/mL. However, the 2D-QSAR analysis using the multivariable regression analysis was performed only for Bacillus subtilis. Next, the series of 2-[4-(4-substituted benzamido/phenylacetamido)phenyl] benzothiazoles was prepared and evaluated, the MIC values ranged between 6.25 and 100 µg/mL (Fig. 1D) [20]. The majority of the compounds showed more antibacterial activity against the screened drug-resistant clinical isolate of S. aureus compared to the non-resistant of S. aureus ATCC 25923. Further development based on the use of structural effects already identified led to the synthesis of a series of 2-[4-(4-substituted benzamido/phenylacetamido/phenylpropionamido)benzyl/phenyl] benzothiazoles ( Fig. 1E) [21]. Evaluation of their activity against various bacterial pathogens showed MIC values between 6.25 and 200 µg/mL. Two compounds exhibited great antimicrobial activity against the drug-resistant clinical isolate of S. aureus, but the structural effects were not evaluated by QSAR. Finally, two groups of antibacterial compounds were designed and synthesized using two pharmacologically compatible moieties in one molecule by attaching a sulfonamide group to a benzoxazole [22]. The first was derivatives of 5-amino-2-(4-substituted phenyl/benzyl)benzoxazole (Fig. 1F). The derivatives of 2-substituted-5-(4-nitro/aminophenylsulfonamido)benzoxazole ( Fig. 1G) form the second group. Minimal inhibitory concentrations of these derivatives are between 8 and 256 µg/mL. The structural effects on the MIC of these compounds have been evaluated only for their activity against Mycobacterium tuberculosis [23,24]. An irreplaceable step in the targeted search for suitable antibacterial compounds is the analysis of the relationship between the structure and the biological effect of a substance. The structural diversity of the compounds mentioned above does not allow the use of classical QSAR procedures to evaluate their antibacterial activity against S. aureus. Therefore, in this work, we used hybrid optimal descriptors calculated with the molecular graph, i.e., based on the description of the entire structure of a molecule. A simplified molecular input-line entry system (SMILES) represents an appealing alternative to representing the molecu-lar structure by a graph, and the development of SMILESbased QSAR becomes a promising way of research work in the field of QSAR theory and applications [25,26]. From the medicinal chemistry point of view, only one SMILESbased QSAR model describing the effect of structure on antibacterial activity against S. aureus has been published yet. In 2020, Lotfi et al. [27] studied the possibility of predicting the MIC of 204 ionic liquids against S. aureus and found that developed QSAR models are at a high level.
Consequently, this study aims to combine the results of previous work and evaluate the antibacterial effects of a total of 151 compounds against S. aureus using SMILESbased hybrid optimal descriptors.

Data
The structures of the examined compounds and their MIC against (i) S. aureus ATCC 25923, and (ii) drugresistant clinical isolate of S. aureus were taken from previous publications [16,[18][19][20][21][22]. The molecular structure of the compounds was transferred to the SMILES notation using ACD/ChemSketch software [28]. Due to the structural diversity of examined compounds, the MIC values were recalculated from µg/mL to mol/L units and expressed as logarithms of reciprocal values. This makes it possible to unify the antibacterial activity of the test substances with respect to the number of molecules (and not the weight). The complete data are represented in Table 1.

Optimal hybrid descriptor
The molecular structure can be represented by SMILES and/or a molecular graph (hydrogen suppressed graph). Fig. 2 contains an example of the molecular structure together with the SMILES and the hydrogen suppressed graph for compound 78.
The hybrid optimal descriptors [10] are sensitive to both above-mentioned representations of the molecular structure. Hybrid optimal descriptors are calculated by optimization of the so-called correlation weights of the SMILES attributes together with the correlation weights of the graph invariants. The optimal hybrid descriptor DCW (T,N) is applied for a predictive model of endpoint via the equation:         Table 3. The statistical characteristics of the models for Endpoint 1 were calculated with Eqs.

13-15 for (A) active training set, (P) passive training set, (C) calibration set, and (V) validation set (n is the number of conpounds in the corresponding set).
Eqn.

(P) passive training set, (C) calibration set, and (V) validation set (n is the number of conpounds in the corresponding set).
Eqn. where If SMILES = ABCD, the S, SS, and SSS can be represented as The EC1, EC2, and EC3 are Morgan extended connectivity of first, second, and third order, respectively. The graph invariants are calculated with the adjacency matrix ( Table 2).
The T is an integer to separate SMILES attributes into rare and non-rare. The non-rare SMILES are applied to build up the model. The rare SMILES are not applied to build up the model.
The N is the number of epochs of the optimization of the correlation weights.  1 (split 1). Na, Np, and Nc are frequencies of a molecular feature in active training, passive training, and calibration sets, respectively. The equivalent promoters are indicated in bold.  The S k is a SMILES atom, i.e., one symbol of the SMILES line (e.g., '=', 'O') or a group of symbols that cannot be examined separately (e.g., 'Cu', '%11').

SMILES attributes and graph invariants
The CW (S k ), CW (SS k ), and CW (SSS k ) are the correlation weights of the above SMILES attributes.

The Monte Carlo optimization
Eqn. 2 needs the numerical data on the above correlation weights. The Monte Carlo optimization is a tool to calculate those correlation weights. Here, two target functions for the Monte Carlo optimization are examined: The r AT and r P T are the correlation coefficient between the observed and predicted endpoints for the active training set and the passive training set, respectively.
The IIC C is the index of ideality of correlation [29,30], and it is calculated with data on the calibration set as the follows: min The observed and calculated are the corresponding values of the endpoint.
The Monte Carlo optimization that used the IIC C is described in the literature [29,30].

Results and discussion
QSARs based on hybrid optimal descriptors were performed for 151 examined compounds (Table 1). Two endpoints were studied: (i) the first was the MIC against S. aureus ATCC 25923, and (ii) the second was the MIC against the drug-resistant clinical isolate of S. aureus.
The examined compounds were randomly split into an active training set (≈25%), passive training set (≈25%), calibration set (≈25%), and validation set (≈25%). Each of the above sets has a defined task. The active training set is used to build the model: molecular features extracted from quasi-SMILES of the active training set are involved in the process of Monte Carlo optimization aimed to provide correlation weights for the above features, which give maximal correlation coefficient between descriptors (the sum of the correlation weights) and endpoint on the active training set. The task of the passive training set is to check whether the model obtained for the active training set is satisfactory for quasi-SMILES that were not involved in the active training set. The calibration set should detect the start of the overtraining (overfitting). At the beginning of the optimization, the correlation coefficients between experimental values of the endpoint and the descriptor contemporaneously increase for all sets, but the correlation coefficient for the calibration set reaches a maximum (this is the start of the overtraining), and further optimization leads to decrease of the correlation coefficient for the calibration set. Optimization should be stopped when overtraining starts. After stopping the Monte Carlo optimization procedure, the validation set is used to assess the predictive potential of the obtained model.

MIC against S. aureus ATCC 25923 (i.e., Endpoint 1)
In order to check up the reproducibility of the CORAL [31] models, one should test several splits into the training sub-system (i.e., active training, passive training, and calibration sets) and validation sub-system. The described scheme for three random splits gives the following models:  Table 3 contains the statistical quality of these models. Table 4 (Ref. [32][33][34][35][36][37]) contains the statistical criteria of the predictive potential of a model.

MIC against drug-resistant clinical isolate of S. aureus (i.e., Endpoint 2)
For Endpoint 2, the described scheme for three random splits gives the following models:   Table 5 contains the statistical quality of these models.

Mechanistic interpretation
An example of the technical details for Split 1, i.e., the calculated values for Endpoint 1 (Eqn. 13) and Endpoint 2 (Eqn. 16), and the corresponding correlation weights for the SMILES attributes and graph invariants, is presented in Supplementary Material.
Having numerical data on the correlation weights obtained in several runs of the described Monte Carlo method optimization, one can find molecular features extracted from SMILES or hydrogen suppressed graphs which have solely positive correlation weights. These should be interpreted as promoters of increase for the corresponding endpoint. If a molecular feature has a stable negative correlation weight in several runs of the optimization, it should be interpreted as a promoter of decrease for an endpoint. Table 6 contains a collection of the above promoters for Endpoint 1 and Table 7 contains similar data for Endpoint 2, respectively. One can see (Tables 6,7), that Endpoint 1 and Endpoint 2 have five equivalent promoters (indicated by bold). In other words, these endpoints are far from to be identical ones.

The statistical quality of the models
The statistical quality of the models for Endpoint 1 and Endpoint 2 is quite good (Tables 3,5). Reproducibility of the results for both endpoints is observed. However, the predictive potential observed for three random splits is not identical. For both endpoints, the best predictive potential is observed in the case of split 2. The statistical quality of the models for Endpoint 2 is slightly better than that of the models for Endpoint 1. The models suggested here are traditional, that is, multi-targets approach [6][7][8], and ADMET [9] are not used here. However, in principle, the approach can be available for the corresponding analyses in the future.

Conclusions
The application of hybrid optimal descriptors has been proposed and tested to develop a predictive model for 151 structurally diverse compounds with antibacterial activity against S. aureus ATCC 25923 (Endpoint 1) or the drugresistant clinical isolate of S. aureus (Endpoint 2) has been proposed and tested. The predictive potential of these models has been checked with three random splits into the training, passive training, calibration, and validation sets. The proposed models give satisfactory predictive models for both endpoints examined, but it has been found that splitting has an apparent influence on the statistical quality of these models, and the best predictive potential is observed in the case of split 2 for both endpoints. The statistical quality of the models is slightly better for the Endpoint 2 models. The results of the study show the possibility of SMILESbased QSAR in the evaluation of the antibacterial activity of structurally diverse compounds.

Author contributions
KN and AT designed the study and participated in writing the manuscript. AT performed the study, software, and calculation. IY provided data and participated in writing the manuscript. KN handled the funding acquisition. All authors contributed to editorial changes in the manuscript. All authors read and approved the final manuscript.

Ethics approval and consent to participate
Not applicable.