IMR Press / FBL / Volume 27 / Issue 6 / DOI: 10.31083/j.fbl2706188
Open Access Original Research
An Inverse QSAR Method Based on Linear Regression and Integer Programming
Show Less
1 Department of Applied Mathematics and Physics, Kyoto University, 606-8501 Kyoto, Japan
2 Graduate School of Advanced Integrated Studies in Human Survavibility (Shishu-Kan), Kyoto University, 606-8306 Kyoto, Japan
3 Bioinformatics Center, Institute for Chemical Research, Kyoto University, 611-0011 Uji, Japan
*Correspondence: zhujs@amp.i.kyoto-u.ac.jp (Jianshen Zhu)
These authors contributed equally.
Academic Editors: Agnieszka Kaczor and Graham Pawelec
Front. Biosci. (Landmark Ed) 2022 , 27(6), 188; https://doi.org/10.31083/j.fbl2706188
Submitted: 16 February 2022 | Revised: 28 March 2022 | Accepted: 7 April 2022 | Published: 10 June 2022
Copyright: © 2022 The Author(s). Published by IMR Press.
This is an open access article under the CC BY 4.0 license.
Abstract

Background: Drug design is one of the important applications of biological science. Extensive studies have been done on computer-aided drug design based on inverse quantitative structure activity relationship (inverse QSAR), which is to infer chemical compounds from given chemical activities and constraints. However, exact or optimal solutions are not guaranteed in most of the existing methods. Method: Recently a novel framework based on artificial neural networks (ANNs) and mixed integer linear programming (MILP) has been proposed for designing chemical structures. This framework consists of two phases: an ANN is used to construct a prediction function, and then an MILP formulated on the trained ANN and a graph search algorithm are used to infer desired chemical structures. In this paper, we use linear regression instead of ANNs to construct a prediction function. For this, we derive a novel MILP formulation that simulates the computation process of a prediction function by linear regression. Results: For the first phase, we performed computational experiments using 18 chemical properties, and the proposed method achieved good prediction accuracy for a relatively large number of properties, in comparison with ANNs in our previous work. For the second phase, we performed computational experiments on five chemical properties, and the method could infer chemical structures with around up to 50 non-hydrogen atoms. Conclusions: Combination of linear regression and integer programming is a potentially useful approach to computational molecular design.

Keywords
machine learning
linear regression
integer programming
chemoinformatics
materials informatics
QSAR/QSPR
molecular design
1. Introduction

Analysis of the activities and properties of chemical compounds is important not only for chemical science but also for biological science because chemical compounds play important roles in metabolic and many other pathways. Computational prediction of chemical activities from their structural data has been studied for several decades under the name of quantitative structure activity relationship (QSAR) [1]. In addition to traditional regression-based methods, various machine learning methods have been applied to QSAR [2, 3]. Recently, neural networks and deep-learning technologies have extensively been applied to QSAR [4].

Inference of chemical structures with desired chemical activities under some constraints is also important because of its potential applications to drug design, and the problem has been studied under the name of inverse quantitative structure activity relationship (inverse QSAR). Chemical compounds are commonly represented by undirected graphs called chemical graphs in which vertices and edges correspond to atoms and chemical bonds, respectively. Due to the difficulty of directly handling chemical graphs in both QSAR and inverse QSAR, chemical compounds are usually represented as vectors of integer or real numbers, which are called descriptors in chemoinformatics and correspond to feature vectors in machine learning. In inverse QSAR, one major approach is to first infer feature vectors from given chemical activities and constraints, and then reconstruct chemical structures from these feature vectors [5, 6, 7]. However, the reconstruction itself is not an easy task because the number of possible chemical graphs is huge. For example, the number of chemical graphs with up to 30 atoms (vertices) C, N, O, and S may exceed 1060 [8]. Indeed, the problem to infer a chemical graph from a given feature vector is known as a computationally difficult problem (precisely, NP-hard) except for some simple cases [9]. Most existing methods for inverse QSAR do not guarantee exact or optimal solutions due to these inherent difficulties.

Recently, artificial neural networks (ANNs), in particular, graph convolutional networks [10] are extensively used for inverse QSAR. For example, recurrent neural networks [11, 12], variational autoencoders [13], grammar variational autoencoders [14], invertible flow models [15, 16], and generative adversarial networks [17] have been applied. However, these methods do not yet guarantee exact or optimal solutions.

Akutsu and Nagamochi [18] proved that the computation process of a given ANN can be simulated as a mixed integer linear programming (MILP). Based on this result, a novel framework for inferring a set of chemical graphs has been developed [19, 20], which is illustrated in Fig. 1. This framework consists of two phases. In the first phase, it constructs a prediction function and in the second phase, it infers a chemical graph. There are three stages in the first phase of the framework. In Stage 1, a chemical property π and a class 𝒢 of graphs are selected, and a property function a is defined so that a() is the value of π for a compound 𝒢. Then we collect a data set Dπ of chemical graphs in 𝒢 such that a() is available for every Dπ. In Stage 2, a feature function f:𝒢K for a positive integer K is introduced. In Stage 3, a prediction function η is constructed with an ANN 𝒩 that, given a vector xK, returns a value y=η(x) so that η(f()) serves as a predicted value to a() of π for each Dπ. Given a target chemical value y*, the second phase consists the next two phases to infer chemical graphs * with η(f(*))=y*. A feature function f and a prediction function η are obtained in the first phase, and we call an additional constraint on the substructures of target chemical graphs a topological specification. In Stage 4, the following two MILP formulations are designed:

- MILP (x,y;𝒞1) with a set 𝒞1 of linear constraints on variables x and y (and some other auxiliary variables) simulates the process of computing y:=η(x) from a vector x; and

- MILP (g,x;𝒞2) with a set 𝒞2 of linear constraints on variable x and a variable vector g that represents a chemical graph (and some other auxiliary variables) simulates the process of computing x:=f() from a chemical graph and chooses a chemical graph that satisfies the given topological specification σ.

Fig. 1.

An illustration of a framework for inferring a set of chemical graphs *.

Given a target value y*, the combined MILP (g,x,y;𝒞1,𝒞2) is solved to find a feature vector x*K and a chemical graph that satisfies the specification σ such that f()=x* and η(x*)=y* (where if the MILP is infeasible then this suggests that such a desired chemical graph does not exist). In Stage 5, by using the inferred chemical graph , we generate other chemical graphs * such that η(f(*))=y*.

Stage 4 MILP formulations to infer chemical graphs with cycle index 0, 1 and 2 are proposed in [20, 21, 22, 23], respectively, but no sophisticated topological specification was available yet. A restricted class of acyclic graphs that is characterized by an integer ρ, called a “branch-parameter” is introduced by Azam et al. [21]. This restricted class still covers most of the acyclic chemical compounds in the database. Akutsu and Nagamochi [24] extended the idea to define a restricted class of cyclic graphs, called “ρ-lean cyclic graphs” and introduced a set of flexible rules for describing a topological specification. Tanaka et al. [25] used a decision tree instead of ANNs to construct a prediction function η in Stage 3 in the framework and an MILP (x,y;𝒞1) that simulates the computation process of a decision tree.

Recently Shi et al. [26] proposed a new model to deal with an arbitrary graph in the framework called a two-layered model to represent the feature of a chemical graph. Also, the set of rules for describing a topological specification in [27] was refined so that a prescribed structure can be included in both of the acyclic and cyclic parts of a chemical graph . In this model, a chemical graph with an integer ρ1, we consider two parts, namely, the exterior and the interior of the hydrogen-suppressed graph that is obtained by removing hydrogen from . The exterior consists of maximal acyclic induced subgraphs with height at most ρ in and the interior is the connected subgraph of obtained by excluding the exterior. Shi et al. [26] also defined a feature vector f() of a chemical graph as a combination of the frequency of adjacent atom pairs in the interior and the frequency of chemical acyclic graphs among the set of chemical rooted trees Tu rooted at interior-vertices u. Recently, Tanaka et al. [25] extended the model in order to directly treat a chemical graph with hydrogens so that the feature of the exterior can be represented with more variety of chemical rooted trees.

The contribution of this paper is as follows. Firstly, we make a slight modification to a model of chemical graphs proposed by Tanaka et al. [25] so that we can treat a chemical element with multi-valence such as sulfur S and a chemical graph with cations and anions. Then, we consider the prediction function. One of the most important factors in the framework is the quality of a prediction function η constructed in Stage 3. Also, overfitting is pointed out as a major issue in ANN-based approaches for QSAR because many parameters need to be optimized for ANNs [4]. In this paper, to construct a prediction function in Stage 3, we use linear regression instead of ANNs or decision trees. A learning algorithm for an ANN may not find a set of weights and biases that minimizes the error function since the algorithm simply iterates modification of the current weights and biases until it terminates at a local minimum value, and linear regression is much simpler than ANNs and decision trees and thereby we regard the performance of a prediction function by linear regression as the basis for other more sophisticated machine learning methods. In this paper, we derive an MILP formulation (x,y;𝒞1) in Stage 4 to simulate the computation process of a prediction function by linear regression. For an MILP formulation (g,x;𝒞2) that represents a feature function f and a specification σ in Stage 4, we can use the same formulation proposed by Tanaka et al. [25] with a slight modification (the detail of the MILP (g,x;𝒞2) can be found in Supplementary Material). In Stage 5, we can also use the dynamic programming algorithm due to Tanaka et al. [25] with a slight modification to generate target chemical graphs * and the details are omitted in this paper.

We implemented the framework based on the refined two-layered model and a prediction function by linear regression. The results of our computational experiments reveal a set of chemical properties to which a prediction function constructed by linear regression on our feature function performs well, in comparison with ANNs in our previous work. We also observe that chemical graphs with up to 50 non-hydrogen atoms can be inferred by the proposed method.

The paper is organized as follows. Section 2 introduces some notions and terminologies on graphs, modeling of chemical compounds and our choice of descriptors. Section 3 describes our modification to the two-layered model. Section 4 reviews the idea of linear regression and formulates an MILP (x,y;𝒞1) that simulates the computing process of a prediction function constructed by linear regression. Section 5 reports the results of some computational experiments conducted for 18 chemical properties such as vapor density and optical rotation. Section 6 gives conclusions with future work. Some technical details are given in Supplementary Material.

2. Preliminary

In this section, we review some notions and terminologies related to graphs, modeling of chemical compounds introduced by Tanaka et al. [25] and our choice of descriptors.

Let , +, and + denote the sets of reals, non-negative reals, integers and non-negative integers, respectively. For two integers a and b, let [a,b] denote the set of integers i with aib.

Graph Given a graph G, let V(G) and E(G) denote the sets of vertices and edges, respectively. For a subset VV(G) (resp., EE(G)) of a graph G, let G-V (resp., G-E) denote the graph obtained from G by removing the vertices in V (resp., the edges in E), where we remove all edges incident to a vertex in V in G-V. An edge subset EE(G) in a connected graph G is called separating (resp., non-separating) if G-E becomes disconnected (resp., G-E remains connected). The rankr(G) of a connected graph G is defined to be the minimum |F| of an edge subset FE(G) such that G-F contains no cycle, where r(G)=|E(G)|-|V(G)|+1. Observe that r(G-E)=r(G)-|E| holds for any non-separating edge subset EE(G). An edge e=u1u2E(G) in a connected graph G is called a bridge if {e} is separating, i.e., G-e consists of two connected graphs Gi containing vertex ui, i=1,2. For a connected cyclic graph G, an edge e is called a core-edge if it is in a cycle of G or is a bridge e=u1u2 such that each of the connected graphs Gi, i=1,2, of G-e contains a cycle. A vertex incident to a core-edge is called a core-vertex of G. A path with two end-vertices u and v is called a u,v-path.

A vertex designated in a graph G is called a root. In this paper, we designate at most two vertices as roots, and denote by Rt(G) the set of roots of G. We call a graph Grooted (resp., bi-rooted) if |Rt(G)|=1 (resp., |Rt(G)|=2), where we call Gunrooted if Rt(G)=.

For a graph G possibly with roots, a leaf-vertex is defined to be a non-root vertex vV(G)Rt(G) with degree 1. Call the edge uv incident to a leaf vertex v a leaf-edge, and denote by Vleaf (G) and Eleaf (G) the sets of leaf-vertices and leaf-edges in G, respectively. For a graph or a rooted graph G, we define graphs Gi,i+ obtained from G by removing the set of leaf-vertices i times so that

(1) G 0 := G ; G i + 1 := G i - V leaf ( G i ) ,

where we call a vertex vVleaf (Gk) a leaf k-branch and we say that a vertex vVleaf (Gk) has height ht (v)=k in G. The height ht (T) of a rooted tree T is defined to be the maximum of ht(v) of a vertex vV(T). For an integer k0, we call a rooted tree T k-lean if T has at most one leaf k-branch. For an unrooted cyclic graph G, we regard that the set of non-core-edges in G induces a collection 𝒯 of trees each of which is rooted at a core-vertex, where we call G k-lean if each of the rooted trees in 𝒯 is k-lean.

Modeling of Chemical Compounds

We introduce a set of chemical elements such as H (hydrogen), C (carbon), O (oxygen), N (nitrogen) and so on to represent a chemical compound. To distinguish a chemical element a with multiple valences such as S (sulfur), we denote a chemical element a with a valence i by a(i), where we do not use such a suffix (i) for a chemical element a with a unique valence. Let Λ be a set of chemical elements a(i). For example, Λ={H,C,O,N,P,S(2),S(4),S(6)}. Let val :Λ[1,6] be a valence function. For example, val(H)=1, val(C)=4, val(O)=2, val(P)=5, val(S(2))=2, val(S(4))=4 and val(S(6))=6. For each chemical element aΛ, let mass(a) denote the mass of a.

To represent a chemical compound, we use a chemical graph introduced by Tanaka et al. [25], which is defined to be a tuple =(H,α,β) of a simple, connected undirected graph H and functions α:V(H)Λ and β:E(H)[1,3]. The set of atoms and the set of bonds in the compound are represented by the vertex set V(H) and the edge set E(H), respectively. The chemical element assigned to a vertex vV(H) is represented by α(v) and the bond-multiplicity between two adjacent vertices u,vV(H) is represented by β(e) of the edge e=uvE(H). We say that two tuples (Hi,αi,βi),i=1,2 are isomorphic if they admit an isomorphism ϕ, i.e., a bijection ϕ:V(H1)V(H2) such that uvE(H1),α1(u)=a,α1(v)=b,β1(uv)=mϕ(u)ϕ(v)E(H2),α2(ϕ(u))=a,α2(ϕ(v))=b,β2(ϕ(u)ϕ(v))=m. When Hi is rooted at a vertex ri,i=1,2, (Hi,αi,βi),i=1,2 are rooted-isomorphic (r-isomorphic) if they admit an isomorphism ϕ such that ϕ(r1)=r2.

For a notational convenience, we use a function β:V(H)[0,12] for a chemical graph =(H,α,β) such that β(u) means the sum of bond-multiplicities of edges incident to a vertex u; i.e.,

(2) β ( u ) u v E ( H ) β ( u v )

for each vertex uV(H). For each vertex uV(H), define the electron-degree eledeg(u) to be

(3) eledeg ( u ) β ( u ) - val ( α ( u ) ) .

For each vertex uV(H), let deg(u) denote the number of vertices adjacent to the vertex u in .

For a chemical graph =(H,α,β), let Va(), aΛ denote the set vertices vV(H) such that α(v)=a in and define the hydrogen-suppressed chemical graph to be the graph obtained from H by removing all the vertices vVH().

3. Two-layered Model

This section reviews the idea of the two-layered model introduced by Shi et al. [26], and describes our modifications to the model.

Let =(H,α,β) be a chemical graph and ρ1 be an integer, which is called a branch-parameter.

A two-layered model of introduced by Shi et al. [26] is a partition of the hydrogen-suppressed chemical graph into an “interior” and an “exterior” in the following way. We call a vertex vV() (resp., an edge eE()) of G an exterior-vertex (resp., exterior-edge) if ht(v)<ρ (resp., e is incident to an exterior-vertex) and denote the sets of exterior-vertices and exterior-edges by Vex() and Eex(), respectively and denote Vint()=V()Vex() and Eint()=E()Eex(), respectively. We call a vertex in Vint() (resp., an edge in Eint()) an interior-vertex (resp., interior-edge). The set Eex() of exterior-edges forms a collection of connected graphs each of which is regarded as a rooted tree T rooted at the vertex vV(T) with the maximum ht(v). Let 𝒯ex() denote the set of these chemical rooted trees in . The interiorint of is defined to be the subgraph (Vint(),Eint()) of .

Fig. 2 illustrates an example of a hydrogen-suppressed chemical graph . For a branch-parameter ρ=2, the interior of the chemical graph in Fig. 2 is obtained by removing the set of vertices with degree 1 ρ=2 times; i.e., first remove the set V1={w1,w2,,w14} of vertices of degree 1 in and then remove the set V2={w15,w16,,w19} of vertices of degree 1 in -V1, where the removed vertices become the exterior-vertices of .

Fig. 2.

An illustration of a hydrogen-suppressed chemical graph obtained from a chemical graph with r()=4 by removing all the hydrogens, where for ρ=2, Vex()={wii[1,19]} and Vint()={uii[1,28]}

For each interior-vertex uVint(), let Tu𝒯ex() denote the chemical tree rooted at u (where possibly Tu consists of vertex u) and define the ρ-fringe-tree[u] to be the chemical rooted tree obtained from Tu by putting back the hydrogens originally attached to Tu in . Let 𝒯() denote the set of ρ-fringe-trees [u],uVint(). Fig. 3 illustrates the set 𝒯()={[ui]i[1,28]} of the 2-fringe-trees of the example in Fig. 2.

Fig. 3.

The set [ui],i[1,28] of the example in Fig. 2, where the root of each tree is depicted with a gray circle and the hydrogens attached to non-root vertices are omitted in the figure.

Feature Function We extend the feature function of a chemical graph introduced by Tanaka et al. [25]. The feature of an interior-edge e=uvEint() such that α(u)=a, deg(u)=d, α(v)=b, deg(v)=d and β(e)=m is represented by a tuple (ad,bd,m), which is called the edge-configuration of the edge e, where we call the tuple (a,b,m) the adjacency-configuration of the edge e.

For an integer K, a feature vector f() of a chemical graph is defined by a feature functionf that consists of K descriptors. We call K the feature space.

Tanaka et al. [25] defined a feature vector f()K to be a combination of the frequency of edge-configurations of the interior-edges and the frequency of chemical rooted trees among the set of chemical rooted trees [u] over all interior-vertices u. In this paper, we introduce the rank and the adjacency-configuration of leaf-edges as new descriptors in a feature vector of a chemical graph. See Supplementary Material for a full description of descriptors used in Stage 2 of the framework.

Topological Specification

A topological specification is described as a set of the following rules proposed by Shi et al. [26] and modified by Tanaka et al. [25]:

(i) a seed graph GC as an abstract form of a target chemical graph ;

(ii) a set of chemical rooted trees as candidates for a tree [u] rooted at each exterior-vertex u in ; and

(iii) lower and upper bounds on the number of components in a target chemical graph such as chemical elements, double/triple bounds and the interior-vertices in .

Fig. 4a,b illustrate examples of a seed graph GC and a set of chemical rooted trees, respectively. Given a seed graph GC, the interior of a target chemical graph is constructed from GC by replacing some edges a=uv with paths Pa between the end-vertices u and v and by attaching new paths Qv to some vertices v.

Fig. 4.

(a) An illustration of a seed graph GC with r(GC)=5 where the vertices in VC are depicted with gray circles, the edges in E(2) are depicted with dotted lines, the edges in E(1) are depicted with dashed lines, the edges in E(0/1) are depicted with gray bold lines and the edges in E(=1) are depicted with black solid lines; (b) A set ={ψ1,ψ2,,ψ30}(Dπ) of 30 chemical rooted trees ψi,i[1,30], where the root of each tree is depicted with a gray circle, where the hydrogens attached to non-root vertices are omitted in the figure.

For example, a chemical graph in Fig. 2 is constructed from the seed graph GC in Fig. 4a as follows.

- First replace five edges a1=u1u2,a2=u1u3,a3=u4u7,a4=u10u11 and a5=u11u12 in GC with new paths Pa1=(u1,u13,u2), Pa2=(u1,u14,u3), Pa3=(u4,u15,u16,u7), Pa4=(u10,u17,u18,u19,u11) and Pa5=(u11,u20,u21,u22,u12), respectively to obtain a subgraph G1 of .

- Next attach to this graph G1 three new paths Qu5=(u5,u24), Qu18=(u18,u25,u26,u27) and Qu22=(u22,u28) to obtain the interior of in Fig. 2.

- Finally attach to the interior 28 trees selected from the set and assign chemical elements and bond-multiplicities in the interior to obtain a chemical graph in Fig. 2. In Fig, 3, ψ1 is selected for [ui], i{6,7,11}. Similarly ψ2 for [u9], ψ4 for [u1], ψ6 for [ui], i{3,4,5,10,19,22,25,26}, ψ8 for [u8], ψ11 for [ui], i{2,13,16,17,20}, ψ15 for [u12], ψ19 for [u15], ψ23 for [u21], ψ24 for [u24], ψ25 for [u27], ψ26 for [u23], ψ27 for [u14] and ψ30 for [u28].

Our definition of a topological specification is analogous with the one by Tanaka et al. [25] except for a necessary modification due to the introduction of multiple valences of chemical elements, cations and anions (see Supplementary Material for a full description of topological specification).

4. Linear Regressions

For an integer p1 and a vector xp, the j-th entry of x is denoted by x(j),j[1,p].

Let D be a data set of chemical graphs with an observed value a(), where we denote by ai=a(i) for an indexed graph i.

Let f be a feature function that maps a chemical graph to a vector f()K where we denote by xi=f(i) for an indexed graph i. For a prediction function η:K, define an error function

(4) Err ( η ; D ) i D ( a i - η ( f ( i ) ) ) 2 = i D ( a i - η ( x i ) ) 2 ,

and define the coefficient of determination R2(η,D) to be

(5) R 2 ( η , D ) 1 - Err ( η ; D ) i D ( a i - a ~ ) 2 for a ~ = 1 | D | D a ( ) .

For a feature space K, a hyperplane is denoted by a pair (w,b) of a vector wK and a real b. Given a hyperplane (w,b)Kw, a prediction function ηw,b:K is defined by setting

(6) η w , b ( x ) w x + b = j [ 1 , K ] w ( j ) x ( j ) + b .

We wish to find a hyperplane (w,b) that minimizes the error function Err(ηw,b;D). In many cases, a feature vector f contains descriptors that do not play an essential role in constructing a good prediction function. When we solve the minimization problem, the entries w(j) for some descriptors j[1,K] in the resulting hyperplane (w,b) become zero, which means that these descriptors were not necessarily important for finding a prediction function ηw,b. It is proposed that solving the minimization with an additional penalty term τ to the error function often results in more number of entries w(j)=0, reducing a set of descriptors necessary for defining a prediction function ηw,b. For an error function with such a penalty term, a Ridge function 12|D|Err(ηw,b;D)+λ[j[1,K]w(j)2+b2] [28] and a Lasso function 12|D|Err(ηw,b;D)+λ[j[1,K]|w(j)|+|b|] [29] are known, where λ is a given real number.

Given a prediction function ηw,b, we can simulate a process of computing the output ηw,b(x) for an input xK as an MILP (x,y;𝒞1) in the framework. By solving such an MILP for a specified target value y*, we can find a vector x*K such that ηw,b(x*)=y*. Instead of specifying a single target value y*, we use lower and upper bounds y¯*,y¯* on the value a() of a chemical graph to be inferred. We can control the range between y¯* and y¯* for searching a chemical graph by setting y¯* and y¯* to be close or different values. A desired MILP is formulated as follows.

(x,y;𝒞1): An MILP formulation for the inverse problem to prediction function.

constants:

- A hyperplane (w,b) with wK and b;

- Real values y¯*,y¯* such that y¯*<y¯*;

- A set I of indices j[1,K] such that the j-th descriptor dcpj() is always an integer;

- A set I+ of indices j[1,K] such that the j-th descriptor dcpj() is always non-negative;

- (j),u(j), j[1,K]: lower and upper bounds on the j-th descriptor;

variables:

- Non-negative integer variable x(j)+, jII+;

- Integer variable x(j),jII+;

- Non-negative real variable x(j)+,jI+I;

- Real variable x(j),j[1,K](II+);

constraints:

(7) ( j ) x ( j ) u ( j ) , j [ 1 , K ] ;
y ¯ * j [ 1 , K ] w ( j ) x ( j ) + b y ¯ *

objective function: none.

The number of variables and constraints in the above MILP formulation is O(K). It is not difficult to see that the above MILP is an NP-hard problem. The entire MILP for Stage 4 consists of the two MILPs (x,y;𝒞1) and (g,x;𝒞2) with no objective function. The latter represents the computation process of our feature function f and a given topological specification. See Supplementary Material for the details of MILP (g,x;𝒞2).

5. Results

We implemented our method of Stages 1 to 5 for inferring chemical graphs under a given topological specification and conducted experiments to evaluate the computational efficiency. We executed the experiments on a PC with Processor: Core i7-9700 (3.0 GHz; 4.7 GHz at the maximum) and Memory: 16 GB RAM DDR4.

Results on Phase 1. We have conducted experiments of linear regression for 37 chemical properties among which we report the following 18 properties to which the test coefficient of determination R2 attains at least 0.8: octanol/water partition coefficient (Kow), heat of combustion (Hc), vapor density (Vd), optical rotation (OptR), electron density on the most positive atom (EDPA), melting point (Mp), heat of atomization (Ha), heat of formation (Hf), internal energy at 0K (U0), energy of lowest unoccupied molecular orbital (Lumo), isotropic polarizability (Alpha), heat capacity at 298.15K (Cv), solubility (Sl), surface tension (SfT), viscosity (Vis), isobaric heat capacities in liquid phase (IhcLiq), isobaric heat capacities in solid phase (IhcSol) and lipophilicity (Lp).

We used data sets provided by HSDB from PubChem [30]for Kow, Hc, Vd and OptR, M. Jalali-Heravi and M. Fatemi [31] for EDPA, Roy and Saha [32] for Mp, Ha and Hf, MoleculeNet [33] for U0, Lumo, Alpha, Cv and Sl, Goussard et al. [34] for SfT and Vis, R. Naef [35] for IhcLiq and IhcSol, and Figshare [36] for Lp.

Properties U0, Lumo, Alpha and Cv share a common original data set D* with more than 130,000 compounds, and we used a set Dπ of 1,000 graphs randomly selected from D* as a common data set of these four properties π in this experiment.

Stages 1, 2 and 3 in Phase 1 are implemented as follows.

Stage 1. We set a graph class 𝒢 to be the set of all chemical graphs with any graph structure, and set a branch-parameter ρ to be 2.

For each of the properties, we first select a set Λ of chemical elements and then collect a data set Dπ on chemical graphs over the set Λ of chemical elements. During construction of the data set Dπ, chemical compounds that do not satisfy one of the following are eliminated: the graph is connected, the number of non-hydrogen neighbors of each atom is at most four, and the number of carbon atoms is at least four.

Table 1 shows the size and range of data sets that we prepared for each chemical property in Stage 1, where we denote the following:

- Λ: the set of elements used in the data set Dπ; Λ is one of the following 11 sets: Λ1={H,C,O}; Λ2={H,C,O,N}; Λ3={H,C,O,S(2)}; Λ4={H,C,O,Si}; Λ5={H,C,O,N,Cl,P(3),P(5)}; Λ6={H,C,O,N,S(2),F}; Λ7={H,C,O,N,S(2),S(6),Cl}; Λ8={H,C(2),C(3),C(4),O,N(2),N(3)}; Λ9={H,C,O,N,S(2),S(4),S(6),Cl}; Λ10={H,C(2),C(3),C(4),C(5),O,N(1),N(2),N(3),F}; and Λ11={H,C(2),C(3),C(4),O,N(2),N(3),S(2),S(4),S(6),Cl}, where 𝚎(i) for a chemical element 𝚎 and an integer i1 means that a chemical element 𝚎 with valence i.

- |Dπ|: the size of data set Dπ over Λ for the property π.

- n¯,n¯: the minimum and maximum values of the number n() of non-hydrogen atoms in compounds in Dπ.

- a¯,a¯: the minimum and maximum values of a() for π over compounds in Dπ.

- |Γ|: the number of different edge-configurations of interior-edges over the compounds in Dπ.

- ||: the number of non-isomorphic chemical rooted trees in the set of all 2-fringe-trees in the compounds in Dπ.

- K: the number of descriptors in a feature vector f().

Table 1.Results in Phase 1.
π Λ |Dπ| n¯,n¯ a¯,a¯ |Γ| || K λπ K test R2
Kow Λ2 684 4, 58 –7.5, 15.6 25 166 223 6.4E-5 80.3 0.953
Kow Λ9 899 4, 69 –7.5, 15.6 37 219 303 5.5E-5 112.1 0.927
Hc Λ2 255 4, 63 49.6, 35099.6 17 106 154 1.9E-4 19.2 0.946
Hc Λ7 282 4, 63 49.6, 35099.6 21 118 177 1.9E-4 20.5 0.951
Vd Λ2 474 4, 30 0.7, 20.6 21 160 214 1.0E-3 3.6 0.927
Vd Λ5 551 4, 30 0.7, 20.6 24 191 256 5.5E-4 8.0 0.942
OptR Λ2 147 5, 44 –117.0, 165.0 21 55 107 4.6E-4 39.2 0.823
OptR Λ6 157 5, 69 –117.0, 165.0 25 62 123 7.3E-4 41.7 0.825
EDPA Λ1 52 11, 16 0.80, 3.76 9 33 64 1.0E-4 10.9 0.999
Mp Λ2 467 4, 122 –185.33, 300.0 23 142 197 3.7E-5 82.5 0.817
Ha Λ3 115 4, 11 1100.6, 3009.6 8 83 115 3.7E-5 39.0 0.997
Hf Λ1 82 4, 16 30.2, 94.8 5 50 74 1.0E-4 34.0 0.987
U0 Λ10 977 4, 9 –570.6,  –272.8 59 190 297 1.0E-7 246.7 0.999
Lumo Λ10 977 4, 9 –0.11, 0.10 59 190 297 6.4E-5 133.9 0.841
Alpha Λ10 977 4, 9 50.9, 99.6 59 190 297 1.0E-5 125.5 0.961
Cv Λ10 977 4, 9 19.2, 44.0 59 190 297 1.0E-5 165.3 0.961
Sl Λ9 915 4, 55 –11.6, 1.11 42 207 300 7.3E-5 130.6 0.808
SfT Λ4 247 5, 33 12.3, 45.1 11 91 128 6.4E-4 20.9 0.804
Vis Λ4 282 5, 36 –0.64, 1.63 12 88 126 8.2E-4 16.3 0.893
IhcLiq Λ2 770 4, 78 106.3, 1956.1 23 200 256 1.9E-5 82.2 0.987
IhcLiq Λ7 865 4, 78 106.3, 1956.1 29 246 316 8.2E-6 139.1 0.986
IhcSol Λ8 581 5, 70 67.4, 1220.9 33 124 192 2.8E-5 75.9 0.985
IhcSol Λ11 668 5, 70 67.4, 1220.9 40 140 228 2.8E-5 86.7 0.982
Lp Λ2 615 6, 60 –3.62, 6.84 32 116 186 1.0E-4 98.5 0.856
Lp Λ9 936 6, 74 –3.62, 6.84 44 136 231 6.4E-5 130.4 0.840

Stage 2. The newly defined feature function in our chemical model without suppressing hydrogen in Section 3 is used. We standardize the range of each descriptor and the range {ta¯ta¯} of property values a(),Dπ.

Stage 3. For each chemical property π, we select a penalty value λπ in the Lasso function from 36 different values from 0 to 100 by conducting linear regression as a preliminary experiment.

We conducted an experiment in Stage 3 to evaluate the performance of the prediction function based on cross-validation. For a property π, an execution of a cross-validation consists of five trials of constructing a prediction function as follows. First partition the data set Dπ into five subsets D(k), k[1,5] randomly. For each k[1,5], the k-th trial constructs a prediction function η(k) by conducting a linear regression with the penalty term λπ using the set DπD(k) as a training data set. We used scikit-learn version 0.23.2 with Python 3.8.5 for executing linear regression with Lasso function. For each property, we executed ten cross-validations and we show the median of test R2(η(k),D(k)),k[1,5] over all ten cross-validations. Recall that a subset of descriptors is selected in linear regression with Lasso function and let K denote the average number of selected descriptors over all 50 trials. The running time per trial in a cross-validation was at most one second.

Table 1 shows the results on Stages 2 and 3, where we denote the following:

- λπ: the penalty value in the Lasso function selected for a property π, where aEb means a×10b.

- K: the average of the number of descriptors selected in the linear regression over all 50 trials in ten cross-validations.

- test R2: the median of test R2 over all 50 trials in ten cross-validations.

Recall that the adjacency-configuration for leaf-edges was introduced as a new descriptor in this paper. Without including this new descriptor, the test R2 for property Vis was 0.790, that for Lumo was 0.799 and that for Mp was 0.796, while the test R2 for each of the other properties in Table 1 was almost the same.

From Table 1, we observe that a relatively large number of properties admit a good prediction function based on linear regression. The number K of descriptors used in linear regression is considerably small for some properties. For example of property Vd, the four descriptors most frequently selected in the case of Λ={H,O,C,N} are the number of non-hydrogen atoms; the number of interior-vertices v with degCint(v)=1; the number of fringe-trees r-isomorphic to the chemical rooted tree ψ1 in Fig. 5; and the number of leaf-edges with adjacency-configuration (O,C,2). The eight descriptors most frequently selected in the case of Λ={H,O,C,N,Cl,P(3),P(5)} are the number of non-hydrogen atoms; the number of interior-vertices v with degCint(v)=1; the number of exterior-vertices v with α(v)=Cl; the number of interior-edges with edge-configuration γi,i=1,2, where γ1=(C2,C2,2) and γ2=(C3,C4,1); and the number of fringe-trees r-isomorphic to the chemical rooted tree ψi,i=1,2,3 in Fig. 5.

Fig. 5.

An illustration of chemical rooted trees ψ1, ψ1 and ψ3 that are selected in Lasso linear regression for constructing a prediction function to property Vd, where the root is depicted with a gray circle.

For the 18 properties listed in Table 1, we used ANN to construct prediction functions. For this purpose, we used our newly proposed feature vector and the experimental setup as explained in Tanaka et al. [25]. From these computation experiments, we observe that for the properties Hc, Vd, Ha, Hf, U0, Alpha and Cv, the test R2 scores of the prediction functions obtained by Lasso linear regression is at least 0.05 more than those obtained by ANN. For the properties OptR, Sl and SfT, the test R2 scores of the prediction functions obtained by ANN is at least 0.05 more than those obtained by Lasso linear regression. For the other properties, the test R2 scores obtained by Lasso linear regression and ANN are comparable.

Results on Phase 2. We used a set of seven instances Ia, Ibi,i[1,4], Ic and Id based on seed graphs prepared by Shi et al. [26] to execute Stages 4 and 5 in Phase 2. We here present their seed graphs GC (see Supplementary Material for the details of instances Ia, Ibi,i[1,4], Ic and Id). The seed graph GC of instance Ia is illustrated in Fig. 4a. The seed graph GC1 (resp., GCi,i=2,3,4) of instances Ib1 and Id (resp., Ibi,i=2,3,4) is illustrated in Fig. 6.

Fig. 6.

(i) Seed graph GC1 for Ib1 and Id; (ii) Seed graph GC2 for Ib2; (iii) Seed graph GC3 for Ib3; (iv) Seed graph GC4 for Ib4.

Instance Ic has been introduced by Shi et al. [26] in order to infer a chemical graph such that the core of is the same as the core of chemical graph A: CID 24822711 in Fig. 7a and the frequency of each edge-configuration in the non-core of is the same as that of chemical graph B: CID 59170444 illustrated in Fig. 7b. This means that the seed graph GC of Ic is the core of A which is indicated by a shaded area in Fig. 7a.

Instance Id has been introduced by Shi et al. [26] in order to infer a monocyclic chemical graph such that the frequency vector of edge-configurations in is a vector obtained by merging those of two chemical graphs A: CID 10076784 and B: CID 44340250 illustrated in Fig. 7c,d, respectively.

Fig. 7.

An illustration of chemical compounds for instances Ic and Id: (a) A: CID 24822711; (b) B: CID 59170444; (c) A: CID 10076784; (d) B: CID 44340250, where hydrogens are omitted.

Stage 4. We executed Stage 4 for five properties π{Hc, Vd, OptR, IhcLiq, Vis}.

For the MILP formulation (x,y;𝒞1) in Section 4, we use the prediction function ηw,b that attained the median test R2 in Table 1. We used CPLEX version 12.10 to solve an MILP in Stage 4. Tables 2,3,4,5,6 show the computational results of the experiment in Stage 4 for the five properties, where we denote the following:

- y¯*,y¯*: lower and upper bounds y¯*, y¯* on the value a() of a chemical graph to be inferred;

- #v (resp., #c): the number of variables (resp., constraints) in the MILP in Stage 4;

- I-time: the time (sec.) to solve the MILP in Stage 4;

- n: the number n() of non-hydrogen atoms in the chemical graph inferred in Stage 4; and

- nint: the number nint() of interior-vertices in the chemical graph inferred in Stage 4;

- η(f()): the predicted property value η(f()) of the chemical graph inferred in Stage 4.

Table 2.Results of Stages 4 and 5 for Hc using Lasso linear regression.
inst. y¯*,y¯* #v #c I-time n nint η(f()) D-time -LB #
Ia 5950, 6050 9902 9255 4.6 44 25 5977.9 0.068 1 1
Ib1 5950, 6050 9404 6776 1.7 36 10 6007.1 0.048 6 6
Ib2 5950, 6050 11729 9891 16.7 50 25 6043.7 38.7 2.4×105 100
Ib3 5950, 6050 11510 9894 16.3 47 25 6015.4 0.353 8724 100
Ib4 5950, 6050 11291 9897 9.0 49 26 5971.6 0.304 84 84
Ic 13700, 13800 6915 7278 0.7 50 33 13703.3 0.016 1 1
Id 13700, 13800 5535 6781 4.9 44 23 13704.7 0.564 4.3×105 100
Table 3.Results of Stages 4 and 5 for Vd using Lasso linear regression.
inst. y¯*,y¯* #v #c I-time n nint η(f()) D-time -LB #
Ia 16, 17 9481 9358 1.6 38 23 16.83 0.070 1 1
Ib1 16, 17 9928 6986 1.5 35 12 16.68 0.206 48 48
Ib2 21, 22 12373 10101 10.0 48 25 21.62 0.104 20 20
Ib3 21, 22 12159 10104 6.5 48 25 21.95 3.65 8.6×105 100
Ib4 21, 22 11945 10107 8.1 48 25 21.34 0.057 6 6
Ic 21, 22 7073 7438 0.7 50 34 21.89 0.016 1 1
Id 17, 18 5693 6942 2.1 41 23 17.94 0.161 216 100
Table 4.Results of Stages 4 and 5 for OptR using Lasso linear regression.
inst. y¯*,y¯* #v #c I-time n nint η(f()) D-time -LB #
Ia 70, 71 8962 9064 3.5 40 23 70.1 0.061 1 1
Ib1 70, 71 9432 6662 2.7 37 14 70.1 0.185 2622 100
Ib2 70, 71 11818 9773 10.0 50 25 70.8 0.041 4 4
Ib3 70, 71 11602 9776 10.2 50 25 70.2 0.241 60 60
Ib4 70, 71 11386 9779 24.7 49 25 70.9 6.39 4.6×105 100
Ic –112, –111 6807 7170 1.8 50 32 -111.9 0.016 1 1
Id 70, 71 5427 6673 6.1 42 23 70.2 0.127 78768 100
Table 5.Results of Stages 4 and 5 for IhcLiq using Lasso linear regression.
inst. y¯*,y¯* #v #c I-time n nint η(f()) D-time -LB #
Ia 1190, 1210 10180 9538 3.9 48 26 1208.5 0.071 2 2
Ib1 1190, 1210 10784 7191 2.4 35 14 1206.7 0.082 12 12
Ib2 1190, 1210 13482 10302 14.1 47 25 1206.7 0.11 12 12
Ib3 1190, 1210 13275 10301 9.0 49 27 1209.9 0.090 24 24
Ib4 1190, 1210 13128 10306 16.5 50 25 1208.4 0.424 2388 100
Ic 1190, 1210 7193 7560 0.8 50 33 1196.5 0.016 1 1
Id 1190, 1210 5813 7063 2.2 44 23 1198.8 5.63 5.2×105 100
Table 6.Results of Stages 4 and 5 for Vis using Lasso linear regression.
inst. y¯*,y¯* #v #c I-time n nint η(f()) D-time -LB #
Ia 1.25, 1.30 6847 8906 1.3 38 22 1.295 0.042 2 2
Ib1 1.25, 1.30 7270 6397 2.5 36 15 1.272 0.155 140 100
Ib2 1.85, 1.90 8984 9512 8.9 45 25 1.879 0.149 288 100
Ib3 1.85, 1.90 8741 9515 16.2 45 26 1.880 0.137 4928 100
Ib4 1.85, 1.90 8498 9518 8.1 45 25 1.851 0.13 660 100
Ic 2.75, 2.80 6813 7162 1.0 50 33 2.763 0.025 4 4
Id 1.85, 1.90 5433 6665 2.7 41 23 1.881 0.138 4608 100

From Tables 2,3,4,5,6 we observe that an instance with a large number of variables and constraints takes more running time than those with a smaller size in general. We solved all instances in this experiment with our MILP formulation in a few seconds to around 30 seconds.

Fig. 8a–e illustrate the chemical graphs inferred from Ic with (y¯*,y¯*)=(13700,13800) of Hc, Ib2 with (y¯*,y¯*)=(21,22) of Vd, Ib4 with (y¯*,y¯*)=(70,71) of OptR, Id with (y¯*,y¯*)=(1190,1210) of IhcLiq, and Ib3 with (y¯*,y¯*)=(1.85,1.90) of Vis, respectively.

Fig. 8.

(a) with η(f())=13703.3 inferred from Ic with (y¯*,y¯*)=(13700,13800) of Hc; (b) with η(f())=21.62 inferred from Ib2 with (y¯*,y¯*)=(21,22) of Vd; (c) with η(f())=70.9 inferred from Ib4 with (y¯*,y¯*)=(70,71) of OptR; (d) with η(f())=1198.8 inferred from Id with (y¯*,y¯*)=(1190,1210) of IhcLiq; (e) with η(f())=1.880 inferred from Ib3 with (y¯*,y¯*)=(1.85,1.90) of Vis; (f) inferred from Ib4 with lower and upper bounds on the predicted property value ηπ(f()) of property π{Kow, Lp, Sl} in Table 9.

Similarly, we executed Stage 4 for these seven instances Ia, Ibi,i[1,4], Ic and Id for five properties π{Hc, Vd, OptR, IhcLiq, Vis} by using the prediction functions obtained by ANN. We list the running time to solve MILP formulation for each of these instances in Tables 7,8. From the computation experiments, we observe that for many instances, the running time is significantly faster than that of Stage 4 based on ANN.

Table 7.Running time of Stage 4 for Hc, Vd and OptR using ANN.
Hc Vd OptR
inst. y¯*,y¯* I-time inst. y¯*,y¯* I-time inst. y¯*,y¯* I-time
Ia 13350, 13450 24.7 Ia 18, 19 18.1 Ia 62, 63 35.6
Ib1 9650, 9750 13.5 Ib1 13, 14 9.4 Ib1 109, 110 15.5
Ib2 16750, 16850 70.4 Ib2 15, 16 40.9 Ib2 23, 24 192.6
Ib3 12350, 12450 87.0 Ib3 20, 21 46.3 Ib3 -2, -1 936.4
Ib4 14250, 14350 70.9 Ib4 22, 23 27.1 Ib4 19, 20 63.9
Ic 10400, 10500 31.3 Ic 20, 21 20.5 Ic 86, 87 16.4
Id 12500, 12600 44.3 Id 18, 19 6.1 Id 30, 31 31.8
Table 8.Running time of Stage 4 for IhcLiq and Vis using ANN.
IhcLiq Vis
inst. y¯*,y¯* I-time inst. y¯*,y¯* I-time
Ia 980, 1000 56.6 Ia 1.85, 1.90 2.0
Ib1 1000, 1020 40.4 Ib1 1.95, 2.00 3.5
Ib2 1130, 1150 71.6 Ib2 1.85, 1.90 19.7
Ib3 1240, 1260 45.0 Ib3 2.35, 2.40 26.0
Ib4 1240, 1260 105.7 Ib4 2.50, 2.55 9.3
Ic 810, 830 9.7 Ic 3.90, 3.95 1.8
Id 1100, 1120 25.8 Id 3.30, 3.35 8.3

Inferring a chemical graph with target values in multiple properties Once we obtained prediction functions ηπ for several properties π, include MILP formulations for these functions ηπ into a single MILP (x,y;𝒞1) so as to infer a chemical graph that satisfies given target values y* for these properties at the same time. As an additional experiment in Stage 4, we inferred a chemical graph that has a desired predicted value each of three properties Kow, Lp and Sl, where we used the prediction function ηπ for each property π{Kow, Lp, Sl} constructed in Stage 3. Table 9 shows the result of Stage 4 for inferring a chemical graph from instances Ib2, Ib3 and Ib4 with Λ={H,C,N,O,S(2),S(6),Cl}, where we denote the following:

- π: one of the three properties Kow, Lp and Slused in the experiment;

- y¯π*,y¯π*: lower and upper bounds y¯π*,y¯π* on the predicted property value ηπ(f()) of property π{Kow, Lp, Sl} for a chemical graph to be inferred;

- #v (resp., #c): the number of variables (resp., constraints) in the MILP in Stage 4;

- I-time: the time (sec.) to solve the MILP in Stage 4;

- n: the number n() of non-hydrogen atoms in the chemical graph inferred in Stage 4;

- nint : the number nint () of interior-vertices in the chemical graph inferred in Stage 4; and

- ηπ(f()): the predicted property value ηπ(f()) of property π{Kow, Lp, Sl} for the chemical graph inferred in Stage 4.

Table 9.Results of Stage 4 for instances Ibi,i=2,3,4 with specified target values of three properties Kow, Lp and Sl using Lasso linear regression.
inst. π y¯π*,y¯π* #v #c I-time n nint ηπ(f())
Kow –7.50, –7.40 –7.41
Ib2 Lp –1.40, –1.30 14574 11604 62.7 50 30 –1.33
Sl –11.6, –11.5 –11.52
Kow –7.40, –7.30 –7.38
Ib3 Lp –2.90, –2.80 14370 11596 35.5 48 25 –2.81
Sl –11.6, –11.4 –11.52
Kow –7.50, –7.40 –7.48
Ib4 Lp –0.70, –0.60 14166 11588 71.7 49 26 –0.63
Sl –11.4, –11.2 –11.39

Fig. 8f illustrates the chemical graph inferred from Ib4 with (y¯π1*,y¯π1*)=(-7.50,-7.40), (y¯π2*,y¯π2*)=(-0.70,-0.60) and (y¯π3*,y¯π3*)=(-11.4,-11.2) for π1=Kow, π2=Lp and π3=Sl, respectively.

Stage 5. We executed Stage 5 to generate more target chemical graphs *, where a chemical graph * is called a chemical isomer of a target chemical graph of a topological specification σ if f(*)=f() and * also satisfies the same topological specification σ. We computed chemical isomers * of each target chemical graph inferred in Stage 4. We executed an algorithm to generate chemical isomers of up to 100 when the number of all chemical isomers exceeds 100. We can obtain such an algorithm from the dynamic programming proposed by Tanaka et al. [25] with a slight modification. The algorithm first decomposes into a set of acyclic chemical graphs, next replaces each acyclic chemical graph T with another acyclic chemical graph T that admits the same feature vector as that of T, and finally assembles the resulting acyclic chemical graphs into a chemical isomer * of . Also, a lower bound on the total number of all chemical isomers of can be computed by the algorithm without generating all of them.

Tables 2,3,4,5,6 show the computational results of the experiment in Stage 5 for the five properties, where we denote the following:

- D-time: the running time (sec.) to execute the dynamic programming algorithm in Stage 5 to compute a lower bound on the number of all chemical isomers * of and generate all (or up to 100) chemical isomers *;

- -LB: a lower bound on the number of all chemical isomers * of ; and

- #: the number of all (or up to 100) chemical isomers * of generated in Stage 5.

From Tables 2,3,4,5,6, we observe that for many cases the running time for generating up to 100 target chemical graphs in Stage 5 is less than 0.4 seconds. For some chemical graph , no chemical isomer was found by our algorithm. This is because each acyclic chemical graph in the decomposition of has no alternative acyclic chemical graph than the original one. On the other hand, some chemical graph such as the one in Id in Table 2 admits an extremely large number of chemical isomers *. Remember that we know such a lower bound -LB on the number of chemical isomers without generating all of them.

6. Conclusions

In this paper, we studied the problem of inferring chemical structures from desired chemical properties and constraints, based on the framework proposed and developed in [18, 19, 20]. In the previous applications of the framework of inferring chemical graphs, artificial neural network (ANN) and decision tree have been used for the machine learning of Stage 3. In this paper, we used linear regression in Stage 3 for the first time and derived an MILP formulation that simulates the computation process of linear regression. We also extended a way of specifying a target value y* in a property so that the predicted value η(f()) of a target chemical graph is required to belong to an interval between two specified values y¯* and y¯*. Furthermore, we modified a model of chemical compounds so that multi-valence chemical elements, cation and anion are treated, and introduced the rank and the adjacency-configuration of leaf-edges as new descriptors in a feature vector of a chemical graph.

We implemented the new system of the framework and conducted computational experiments for Stages 1 to 5. We found 18 properties for which linear regression delivers a relatively good prediction function by using our feature vector based on the two-layered model. We also observed that an MILP formulation for inferring a chemical graph in Stage 4 can be solved efficiently over different types of test instances with complicated topological specifications. The experimental result suggests that our method can infer chemical graphs with up to 50 non-hydrogen atoms. Therefore, combination of linear regression and integer programming is a potentially useful approach to computational molecular design.

It is an interesting future work to use other learning methods such as graph convolution networks, random forest and an ensemble method to construct a prediction function and derive the corresponding MILP formulations in Stages 3 and 4 in the framework.

Author Contributions

Conceptialization, HN and TA; methodology, HN; software, JZ, NAA and KH; validation, JZ, NAA and HN; formal analysis, HN; data resources, KH, LZ, HN and TA; writing—original draft preparation, HN; writing—review and editing, NAA and TA; project administration, HN; funding acquisition, TA. All authors contributed to editorial changes in the manuscript. All authors read and approved the final manuscript.

Ethics Approval and Consent to Participate

Not applicable.

Acknowledgment

Not applicable.

Funding

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113.

Conflict of Interest

The authors declare no conflict of interest. TA is serving as the guest editor of this journal. We declare that TA had no involvement in the peer review of this article and has no access to information regarding its peer review. Full responsibility for the editorial process for this article was delegated to AK and GP.

References
[1]
Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M, et al. QSAR modeling: where have you been? Where are you going to? Journal of Medicinal Chemistry. 2014; 57: 4977–5010.
[2]
Lo YC, Rensi SE, Torng W, Altman RB. Machine learning in chemoinformatics and drug discovery. Drug Discovery Today. 2018; 23: 1538–1546.
[3]
Tetko IV, Engkvist O. From Big Data to Artificial Intelligence: chemoinformatics meets new challenges. Journal of Cheminformatics. 2020; 12: 74.
[4]
Ghasemi F, Mehridehnavi A, Pérez-Garrido A, Pérez-Sánchez H. Neural network and deep-learning algorithms used in QSAR studies: merits and drawbacks. Drug Discovery Today. 2018; 23: 1784–1790.
[5]
Miyao T, Kaneko H, Funatsu K. Inverse QSPR/QSAR Analysis for Chemical Structure Generation (from y to x). Journal of Chemical Information and Modeling. 2016; 56: 286–99.
[6]
Ikebata H, Hongo K, Isomura T, Maezono R, Yoshida R. Bayesian molecular design with a chemical language model. The Journal of Computer-Aided Molecular Design. 2017; 31: 379–391.
[7]
Rupakheti C, Virshup A, Yang W, Beratan DN. Strategy to discover diverse optimal molecules in the small molecule universe. Journal of Chemical Information and Modeling. 2015; 55: 529-5-37.
[8]
Bohacek RS, McMartin C, Guida WC. The art and practice of structure-based drug design: a molecular modeling perspective. Medicinal Research Reviews. 1996; 16: 3–50.
[9]
Akutsu T, Fukagawa D, Jansson J, Sadakane K. Inferring a graph from path frequency. Discrete Applied Mathematics. 2012; 160: 1416–1428.
[10]
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv. 2016. (in press)
[11]
Segler MHS, Kogej T, Tyrchan C, Waller MP. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. ACS Central Science. 2018; 4: 120–131.
[12]
Yang X, Zhang J, Yoshizoe K, Terayama K, Tsuda K. ChemTS: an efficient python library for de novo molecular generation. Science and Technology of Advanced Materials. 2017; 18: 972–976.
[13]
Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, et al. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science. 2018; 4: 268–276.
[14]
Kusner MJ, Paige B, Hernández-Lobato JM. Grammar variational autoencoder. Proceedings of the 34th International Conference on Machine Learning. 2017; 70: 1945–1954.
[15]
Madhawa K, Ishiguro K, Nakago K, Abe M. GraphNVP: an invertible flow model for generating molecular graphs. arXiv. 2019. (in press)
[16]
Shi C, Xu M, Zhu Z, Zhang W, Zhang M, Tang J. GraphAF: a flow-based autoregressive model for molecular graph generation. arXiv. 2020. (in press)
[17]
De Cao N, Kipf T. MolGAN: An implicit generative model for small molecular graphs. arXiv. 2018. (in press)
[18]
Akutsu T, Nagamochi H. A mixed integer linear programming formulation to artificial neural networks. Proceedings of the 2019 2nd International Conference on Information Science and Systems. 2019; 215–220.
[19]
Azam NA, Chiewvanichakorn R, Zhang F, Shurbevski A, Nagamochi H, Akutsu T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming. Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies. 2020; 3: 101–108.
[20]
Zhang F, Zhu J, Chiewvanichakorn R, Shurbevski A, Nagamochi H, Akutsu T. ‘A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees’. The 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Kitakyushu, Japan. 2020.
[21]
Azam NA, Zhu J, Sun Y, Shi Y, Shurbevski A, Zhao L, et al. A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming. Algorithms for Molecular Biology. 2021; 16: 18.
[22]
Ito R, Azam NA, Wang C, Shurbevski A, Nagamochi H, Akutsu T. ‘A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming’. BIOCOMP2020. Las Vegas, Nevada, USA. 2020.
[23]
Zhu J, Wang C, Shurbevski A, Nagamochi H, Akutsu T. A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming. Algorithms. 2020; 13: 124.
[24]
Akutsu T, Nagamochi H. A novel method for inference of chemical compounds with prescribed topological substructures based on integer programming. arXiv. 2020. (in press)
[25]
Tanaka K, Zhu J, Azam NA, Haraguchi K, Zhao L, Nagamochi H, Akutsu T. ‘An inverse QSAR method based on decision tree and integer programming’. The 17th International Conference on Intelligent Computing. Shenzhen, China. 2021.
[26]
Shi Y, Zhu J, Azam NA, Haraguchi K, Zhao L, Nagamochi H, et al. An Inverse QSAR Method Based on a Two-Layered Model and Integer Programming. International Journal of Molecular Sciences. 2021; 22: 2847.
[27]
Zhu J, Azam NA, Zhang F, Shurbevski A, Haraguchi K, Zhao L, et al. A Novel Method for Inferring Chemical Compounds with Prescribed Topological Substructures Based on Integer Programming. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2021. (in press)
[28]
Hoerl A, Kennard R. Ridge regression. In Encyclopedia of Statistical Sciences (pp. 129–136). New York: Wiley. 1988.
[29]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996; 58: 267–288.
[30]
Annotations from HSDB (on pubchem). Available at: https://pubchem.ncbi.nlm.nih.gov/ (Accessed: 16 February 2022).
[31]
Jalali-Heravi M, Fatemi MH. Artificial neural network modeling of Kovats retention indices for noncyclic and monocyclic terpenes. Journal of Chromatography A. 2001; 915: 177–183.
[32]
Roy K, Saha A. Comparative QSPR studies with molecular connectivity, molecular negentropy and TAU indices. Journal of Molecular Modeling. 2003; 9: 259–270.
[33]
MoleculeNet. Available at: https://moleculenet.org (Accessed: 16 February 2022).
[34]
Goussard V, François Duprat F, Ploix J-L, Dreyfus G, Nardello-Rataj V, Aubry J-M. A new machine-learning tool for fast estimation of liquid viscosity. application to cosmetic oils. Journal of Chemical Information and Modeling. 2020; 60: 2012–2023.
[35]
Naef R. Calculation of the isobaric heat capacities of the liquid and solid phase of organic compounds at and around 298.15 K based on their “true” molecular volume. Molecules. 2019; 24: 1626.
[36]
Wang JB, Cao DS, Zhu MF, Yun YH, Xiao N, Liang YZ. In silico evaluation of logD7.4 and comparison with other prediction methods. Journal of Chemometrics. 2015; 29: 389–398.
Publisher’s Note: IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Share
Back to top