NULL
Countries | Regions
Countries | Regions
Article Types
Article Types
Year
Volume
Issue
Pages
IMR Press / FBL / Volume 27 / Issue 6 / DOI: 10.31083/j.fbl2706188
Open Access Original Research
An Inverse QSAR Method Based on Linear Regression and Integer Programming
Show Less
1 Department of Applied Mathematics and Physics, Kyoto University, 606-8501 Kyoto, Japan
2 Graduate School of Advanced Integrated Studies in Human Survavibility (Shishu-Kan), Kyoto University, 606-8306 Kyoto, Japan
3 Bioinformatics Center, Institute for Chemical Research, Kyoto University, 611-0011 Uji, Japan
*Correspondence: zhujs@amp.i.kyoto-u.ac.jp (Jianshen Zhu)
These authors contributed equally.
Academic Editors: Agnieszka Kaczor and Graham Pawelec
Front. Biosci. (Landmark Ed) 2022 , 27(6), 188; https://doi.org/10.31083/j.fbl2706188
Submitted: 16 February 2022 | Revised: 28 March 2022 | Accepted: 7 April 2022 | Published: 10 June 2022
This is an open access article under the CC BY 4.0 license.
Abstract

Background: Drug design is one of the important applications of biological science. Extensive studies have been done on computer-aided drug design based on inverse quantitative structure activity relationship (inverse QSAR), which is to infer chemical compounds from given chemical activities and constraints. However, exact or optimal solutions are not guaranteed in most of the existing methods. Method: Recently a novel framework based on artificial neural networks (ANNs) and mixed integer linear programming (MILP) has been proposed for designing chemical structures. This framework consists of two phases: an ANN is used to construct a prediction function, and then an MILP formulated on the trained ANN and a graph search algorithm are used to infer desired chemical structures. In this paper, we use linear regression instead of ANNs to construct a prediction function. For this, we derive a novel MILP formulation that simulates the computation process of a prediction function by linear regression. Results: For the first phase, we performed computational experiments using 18 chemical properties, and the proposed method achieved good prediction accuracy for a relatively large number of properties, in comparison with ANNs in our previous work. For the second phase, we performed computational experiments on five chemical properties, and the method could infer chemical structures with around up to 50 non-hydrogen atoms. Conclusions: Combination of linear regression and integer programming is a potentially useful approach to computational molecular design.

Keywords
machine learning
linear regression
integer programming
chemoinformatics
materials informatics
QSAR/QSPR
molecular design
1. Introduction

Analysis of the activities and properties of chemical compounds is important not only for chemical science but also for biological science because chemical compounds play important roles in metabolic and many other pathways. Computational prediction of chemical activities from their structural data has been studied for several decades under the name of quantitative structure activity relationship (QSAR) [1]. In addition to traditional regression-based methods, various machine learning methods have been applied to QSAR [2, 3]. Recently, neural networks and deep-learning technologies have extensively been applied to QSAR [4].

Inference of chemical structures with desired chemical activities under some constraints is also important because of its potential applications to drug design, and the problem has been studied under the name of inverse quantitative structure activity relationship (inverse QSAR). Chemical compounds are commonly represented by undirected graphs called chemical graphs in which vertices and edges correspond to atoms and chemical bonds, respectively. Due to the difficulty of directly handling chemical graphs in both QSAR and inverse QSAR, chemical compounds are usually represented as vectors of integer or real numbers, which are called descriptors in chemoinformatics and correspond to feature vectors in machine learning. In inverse QSAR, one major approach is to first infer feature vectors from given chemical activities and constraints, and then reconstruct chemical structures from these feature vectors [5, 6, 7]. However, the reconstruction itself is not an easy task because the number of possible chemical graphs is huge. For example, the number of chemical graphs with up to 30 atoms (vertices) C, N, O, and S may exceed $10^{60}$ [8]. Indeed, the problem to infer a chemical graph from a given feature vector is known as a computationally difficult problem (precisely, NP-hard) except for some simple cases [9]. Most existing methods for inverse QSAR do not guarantee exact or optimal solutions due to these inherent difficulties.

Recently, artificial neural networks (ANNs), in particular, graph convolutional networks [10] are extensively used for inverse QSAR. For example, recurrent neural networks [11, 12], variational autoencoders [13], grammar variational autoencoders [14], invertible flow models [15, 16], and generative adversarial networks [17] have been applied. However, these methods do not yet guarantee exact or optimal solutions.

Akutsu and Nagamochi [18] proved that the computation process of a given ANN can be simulated as a mixed integer linear programming (MILP). Based on this result, a novel framework for inferring a set of chemical graphs has been developed [19, 20], which is illustrated in Fig. 1. This framework consists of two phases. In the first phase, it constructs a prediction function and in the second phase, it infers a chemical graph. There are three stages in the first phase of the framework. In Stage 1, a chemical property $\pi$ and a class $\mathcal{G}$ of graphs are selected, and a property function $a$ is defined so that $a(\mathbb{C})$ is the value of $\pi$ for a compound $\mathbb{C}\in\mathcal{G}$. Then we collect a data set $D_{\pi}$ of chemical graphs in $\mathcal{G}$ such that $a(\mathbb{C})$ is available for every $\mathbb{C}\in D_{\pi}$. In Stage 2, a feature function $f:\mathcal{G}\to\mathbb{R}^{K}$ for a positive integer $K$ is introduced. In Stage 3, a prediction function $\eta$ is constructed with an ANN $\mathcal{N}$ that, given a vector $x\in\mathbb{R}^{K}$, returns a value $y=\eta(x)\in\mathbb{R}$ so that $\eta(f(\mathbb{C}))$ serves as a predicted value to $a(\mathbb{C})$ of $\pi$ for each $\mathbb{C}\in D_{\pi}$. Given a target chemical value $y^{*}$, the second phase consists the next two phases to infer chemical graphs $\mathbb{C}^{*}$ with $\eta\left(f\left(\mathbb{C}^{*}\right)\right)=y^{*}$. A feature function $f$ and a prediction function $\eta$ are obtained in the first phase, and we call an additional constraint on the substructures of target chemical graphs a topological specification. In Stage 4, the following two MILP formulations are designed:

- MILP $\mathcal{M}(x,y;\mathcal{C}_{1})$ with a set $\mathcal{C}_{1}$ of linear constraints on variables $x$ and $y$ (and some other auxiliary variables) simulates the process of computing $y:=\eta(x)$ from a vector $x$; and

- MILP $\mathcal{M}(g,x;\mathcal{C}_{2})$ with a set $\mathcal{C}_{2}$ of linear constraints on variable $x$ and a variable vector $g$ that represents a chemical graph $\mathbb{C}$ (and some other auxiliary variables) simulates the process of computing $x:=f(\mathbb{C})$ from a chemical graph $\mathbb{C}$and chooses a chemical graph $\mathbb{C}$ that satisfies the given topological specification $\sigma$.

Fig. 1.

An illustration of a framework for inferring a set of chemical graphs $\mathbb{C}^{*}$.

Given a target value $y^{*}\in\mathbb{R}$, the combined MILP $\mathcal{M}(g,x,y;\mathcal{C}_{1},\mathcal{C}_{2})$ is solved to find a feature vector $x^{*}\in\mathbb{R}^{K}$ and a chemical graph $\mathbb{C}^{\dagger}$ that satisfies the specification $\sigma$ such that $f(\mathbb{C}^{\dagger})=x^{*}$ and $\eta(x^{*})=y^{*}$ (where if the MILP is infeasible then this suggests that such a desired chemical graph does not exist). In Stage 5, by using the inferred chemical graph $\mathbb{C}^{\dagger}$, we generate other chemical graphs $\mathbb{C}^{*}$ such that $\eta(f(\mathbb{C}^{*}))=y^{*}$.

Stage 4 MILP formulations to infer chemical graphs with cycle index 0, 1 and 2 are proposed in [20, 21, 22, 23], respectively, but no sophisticated topological specification was available yet. A restricted class of acyclic graphs that is characterized by an integer ${\rho}$, called a “branch-parameter” is introduced by Azam et al. [21]. This restricted class still covers most of the acyclic chemical compounds in the database. Akutsu and Nagamochi [24] extended the idea to define a restricted class of cyclic graphs, called “${\rho}$-lean cyclic graphs” and introduced a set of flexible rules for describing a topological specification. Tanaka et al. [25] used a decision tree instead of ANNs to construct a prediction function $\eta$ in Stage 3 in the framework and an MILP $\mathcal{M}(x,y;\mathcal{C}_{1})$ that simulates the computation process of a decision tree.

Recently Shi et al. [26] proposed a new model to deal with an arbitrary graph in the framework called a two-layered model to represent the feature of a chemical graph. Also, the set of rules for describing a topological specification in [27] was refined so that a prescribed structure can be included in both of the acyclic and cyclic parts of a chemical graph $\mathbb{C}$. In this model, a chemical graph $\mathbb{C}$ with an integer ${\rho}\geq 1$, we consider two parts, namely, the exterior and the interior of the hydrogen-suppressed graph $\langle\mathbb{C}\rangle$ that is obtained by removing hydrogen from $\mathbb{C}$. The exterior consists of maximal acyclic induced subgraphs with height at most ${\rho}$ in $\langle\mathbb{C}\rangle$ and the interior is the connected subgraph of $\langle\mathbb{C}\rangle$ obtained by excluding the exterior. Shi et al. [26] also defined a feature vector $f(\mathbb{C})$ of a chemical graph $\mathbb{C}$ as a combination of the frequency of adjacent atom pairs in the interior and the frequency of chemical acyclic graphs among the set of chemical rooted trees $T_{u}$ rooted at interior-vertices $u$. Recently, Tanaka et al. [25] extended the model in order to directly treat a chemical graph with hydrogens so that the feature of the exterior can be represented with more variety of chemical rooted trees.

The contribution of this paper is as follows. Firstly, we make a slight modification to a model of chemical graphs proposed by Tanaka et al. [25] so that we can treat a chemical element with multi-valence such as sulfur S and a chemical graph with cations and anions. Then, we consider the prediction function. One of the most important factors in the framework is the quality of a prediction function $\eta$ constructed in Stage 3. Also, overfitting is pointed out as a major issue in ANN-based approaches for QSAR because many parameters need to be optimized for ANNs [4]. In this paper, to construct a prediction function in Stage 3, we use linear regression instead of ANNs or decision trees. A learning algorithm for an ANN may not find a set of weights and biases that minimizes the error function since the algorithm simply iterates modification of the current weights and biases until it terminates at a local minimum value, and linear regression is much simpler than ANNs and decision trees and thereby we regard the performance of a prediction function by linear regression as the basis for other more sophisticated machine learning methods. In this paper, we derive an MILP formulation $\mathcal{M}(x,y;\mathcal{C}_{1})$ in Stage 4 to simulate the computation process of a prediction function by linear regression. For an MILP formulation $\mathcal{M}(g,x;\mathcal{C}_{2})$ that represents a feature function $f$ and a specification $\sigma$ in Stage 4, we can use the same formulation proposed by Tanaka et al. [25] with a slight modification (the detail of the MILP $\mathcal{M}(g,x;\mathcal{C}_{2})$ can be found in Supplementary Material). In Stage 5, we can also use the dynamic programming algorithm due to Tanaka et al. [25] with a slight modification to generate target chemical graphs $\mathbb{C}^{*}$ and the details are omitted in this paper.

We implemented the framework based on the refined two-layered model and a prediction function by linear regression. The results of our computational experiments reveal a set of chemical properties to which a prediction function constructed by linear regression on our feature function performs well, in comparison with ANNs in our previous work. We also observe that chemical graphs with up to 50 non-hydrogen atoms can be inferred by the proposed method.

The paper is organized as follows. Section 2 introduces some notions and terminologies on graphs, modeling of chemical compounds and our choice of descriptors. Section 3 describes our modification to the two-layered model. Section 4 reviews the idea of linear regression and formulates an MILP $\mathcal{M}(x,y;\mathcal{C}_{1})$ that simulates the computing process of a prediction function constructed by linear regression. Section 5 reports the results of some computational experiments conducted for 18 chemical properties such as vapor density and optical rotation. Section 6 gives conclusions with future work. Some technical details are given in Supplementary Material.

2. Preliminary

In this section, we review some notions and terminologies related to graphs, modeling of chemical compounds introduced by Tanaka et al. [25] and our choice of descriptors.

Let $\mathbb{R}$, $\mathbb{R}_{+}$, $\mathbb{Z}$ and $\mathbb{Z}_{+}$ denote the sets of reals, non-negative reals, integers and non-negative integers, respectively. For two integers $a$ and $b$, let $[a,b]$ denote the set of integers $i$ with $a\leq i\leq b$.

Graph Given a graph $G$, let $V(G)$ and $E(G)$ denote the sets of vertices and edges, respectively. For a subset $V^{\prime}\subseteq V(G)$ (resp., $E^{\prime}\subseteq E(G))$ of a graph $G$, let $G-V^{\prime}$ (resp., $G-E^{\prime}$) denote the graph obtained from $G$ by removing the vertices in $V^{\prime}$ (resp., the edges in $E^{\prime}$), where we remove all edges incident to a vertex in $V^{\prime}$ in $G-V^{\prime}$. An edge subset $E^{\prime}\subseteq E(G)$ in a connected graph $G$ is called separating (resp., non-separating) if $G-E^{\prime}$ becomes disconnected (resp., $G-E^{\prime}$ remains connected). The rank$\mathrm{r}(G)$ of a connected graph $G$ is defined to be the minimum $|F|$ of an edge subset $F\subseteq E(G)$ such that $G-F$ contains no cycle, where $\mathrm{r}(G)=|E(G)|-|V(G)|+1$. Observe that $\mathrm{r}(G-E^{\prime})=\mathrm{r}(G)-|E^{\prime}|$ holds for any non-separating edge subset $E^{\prime}\subseteq E(G)$. An edge $e=u_{1}u_{2}\in E(G)$ in a connected graph $G$ is called a bridge if $\{e\}$ is separating, i.e., $G-e$ consists of two connected graphs $G_{i}$ containing vertex $u_{i}$, $i=1,2$. For a connected cyclic graph $G$, an edge $e$ is called a core-edge if it is in a cycle of $G$ or is a bridge $e=u_{1}u_{2}$ such that each of the connected graphs $G_{i}$, $i=1,2$, of $G-e$ contains a cycle. A vertex incident to a core-edge is called a core-vertex of $G$. A path with two end-vertices $u$ and $v$ is called a $u,v$-path.

A vertex designated in a graph $G$ is called a root. In this paper, we designate at most two vertices as roots, and denote by $\mathrm{Rt}(G)$ the set of roots of $G$. We call a graph $G$rooted (resp., bi-rooted) if $|\mathrm{Rt}(G)|=1$ (resp., $|\mathrm{Rt}(G)|=2$), where we call $G$unrooted if $\mathrm{Rt}(G)=\emptyset$.

For a graph $G$ possibly with roots, a leaf-vertex is defined to be a non-root vertex $v\in V(G)\setminus\mathrm{Rt}(G)$ with degree 1. Call the edge $uv$ incident to a leaf vertex $v$ a leaf-edge, and denote by $V_{\text{leaf }}(G)$ and $E_{\text{leaf }}(G)$ the sets of leaf-vertices and leaf-edges in $G$, respectively. For a graph or a rooted graph $G$, we define graphs $G_{i},i\in\mathbb{Z}_{+}$ obtained from $G$ by removing the set of leaf-vertices $i$ times so that

(1)$G_{0}:=G;\quad G_{i+1}:=G_{i}-V_{\text{leaf }}\left(G_{i}\right),$

where we call a vertex $v\in V_{\text{leaf }}\left(G_{k}\right)$ a leaf $k$-branch and we say that a vertex $v\in V_{\text{leaf }}\left(G_{k}\right)$ has $\text{ height ht }(v)=k$ in $G$. The $\text{ height ht }(T)$ of a rooted tree $T$ is defined to be the maximum of $\operatorname{ht}(v)$ of a vertex $v\in V(T)$. For an integer $k\geq 0$, we call a rooted tree $T$ $k$-lean if $T$ has at most one leaf $k$-branch. For an unrooted cyclic graph $G$, we regard that the set of non-core-edges in $G$ induces a collection $\mathcal{T}$ of trees each of which is rooted at a core-vertex, where we call $G$ $k$-lean if each of the rooted trees in $\mathcal{T}$ is $k$-lean.

Modeling of Chemical Compounds

We introduce a set of chemical elements such as H (hydrogen), C (carbon), O (oxygen), N (nitrogen) and so on to represent a chemical compound. To distinguish a chemical element $a$ with multiple valences such as S (sulfur), we denote a chemical element $\mathrm{a}$ with a valence $i$ by $\mathrm{a}_{(i)}$, where we do not use such a suffix $(i)$ for a chemical element $\mathrm{a}$ with a unique valence. Let $\Lambda$ be a set of chemical elements $\mathrm{a}_{(i)}$. For example, $\Lambda=\left\{\mathrm{H},\mathrm{C},\mathrm{O},\mathrm{~{}N},\mathrm{P},% \mathrm{S}_{(2)},\mathrm{S}_{(4)},\mathrm{S}_{(6)}\right\}$. Let $\text{ val }:\Lambda\rightarrow[1,6]$ be a valence function. For example, $\operatorname{val}(\mathrm{H})=1$, $\operatorname{val}(\mathrm{C})=4$, $\operatorname{val}(\mathrm{O})=2$, $\operatorname{val}(\mathrm{P})=5$, $\operatorname{val}\left(\mathrm{S}_{(2)}\right)=2$, $\operatorname{val}\left(\mathrm{S}_{(4)}\right)=4$ and $\operatorname{val}\left(\mathrm{S}_{(6)}\right)=6$. For each chemical element $\mathrm{a}\in\Lambda$, let $\mathrm{mass}(\mathrm{a})$ denote the mass of $\mathrm{a}$.

To represent a chemical compound, we use a chemical graph introduced by Tanaka et al. [25], which is defined to be a tuple $\mathbb{C}=(H,\alpha,\beta)$ of a simple, connected undirected graph $H$ and functions $\alpha:V(H)\to\Lambda$ and $\beta:E(H)\to[1,3]$. The set of atoms and the set of bonds in the compound are represented by the vertex set $V(H)$ and the edge set $E(H)$, respectively. The chemical element assigned to a vertex $v\in V(H)$ is represented by $\alpha(v)$ and the bond-multiplicity between two adjacent vertices $u,v\in V(H)$ is represented by $\beta(e)$ of the edge $e=uv\in E(H)$. We say that two tuples $(H_{i},\alpha_{i},\beta_{i}),i=1,2$ are isomorphic if they admit an isomorphism $\phi$, i.e., a bijection $\phi:V(H_{1})\to V(H_{2})$ such that $uv\in E\left(H_{1}\right),\alpha_{1}(u)=\mathrm{a},\alpha_{1}(v)=\mathrm{b},% \beta_{1}(uv)=m\leftrightarrow\phi(u)\phi(v)\in E\left(H_{2}\right),\alpha_{2}% (\phi(u))=\mathrm{a},\alpha_{2}(\phi(v))=\mathrm{b},\beta_{2}(\phi(u)\phi(v))=m$. When $H_{i}$ is rooted at a vertex $r_{i},i=1,2$, $(H_{i},\alpha_{i},\beta_{i}),i=1,2$ are rooted-isomorphic (r-isomorphic) if they admit an isomorphism $\phi$ such that $\phi(r_{1})=r_{2}$.

For a notational convenience, we use a function $\beta_{\mathbb{C}}:V(H)\to[0,12]$ for a chemical graph $\mathbb{C}=(H,\alpha,\beta)$ such that $\beta_{\mathbb{C}}(u)$ means the sum of bond-multiplicities of edges incident to a vertex $u$; i.e.,

(2)$\beta_{\mathbb{C}}(u)\triangleq\sum_{uv\in E(H)}\beta(uv)$

for each vertex $u\in V(H)$. For each vertex $u\in V(H)$, define the electron-degree $\operatorname{eledeg}_{\mathbb{C}}(u)$ to be

(3)$\operatorname{eledeg}_{\mathbb{C}}(u)\triangleq\beta_{\mathbb{C}}(u)-% \operatorname{val}(\alpha(u)).$

For each vertex $u\in V(H)$, let $\deg_{\mathbb{C}}(u)$ denote the number of vertices adjacent to the vertex $u$ in $\mathbb{C}$.

For a chemical graph $\mathbb{C}=(H,\alpha,\beta)$, let $V_{\mathrm{a}}(\mathbb{C})$, $\mathrm{a}\in\Lambda$ denote the set vertices $v\in V(H)$ such that $\alpha(v)=\mathrm{a}$ in $\mathbb{C}$ and define the hydrogen-suppressed chemical graph$\langle\mathbb{C}\rangle$ to be the graph obtained from $H$ by removing all the vertices $v\in V_{\mathrm{H}}(\mathbb{C})$.

3. Two-layered Model

This section reviews the idea of the two-layered model introduced by Shi et al. [26], and describes our modifications to the model.

Let $\mathbb{C}=(H,\alpha,\beta)$ be a chemical graph and ${\rho}\geq 1$ be an integer, which is called a branch-parameter.

A two-layered model of $\mathbb{C}$ introduced by Shi et al. [26] is a partition of the hydrogen-suppressed chemical graph $\langle\mathbb{C}\rangle$ into an “interior” and an “exterior” in the following way. We call a vertex $v\in V(\langle\mathbb{C}\rangle)$ (resp., an edge $e\in E(\langle\mathbb{C}\rangle))$ of $G$ an exterior-vertex (resp., exterior-edge) if $\operatorname{ht}(v)<{\rho}$ (resp., $e$ is incident to an exterior-vertex) and denote the sets of exterior-vertices and exterior-edges by $V^{\operatorname{ex}}(\mathbb{C})$ and $E^{\mathrm{ex}}(\mathbb{C})$, respectively and denote $V^{\operatorname{int}}(\mathbb{C})=V(\langle\mathbb{C}\rangle)\setminus V^{% \operatorname{ex}}(\mathbb{C})$ and $E^{\operatorname{int}}(\mathbb{C})=E(\langle\mathbb{C}\rangle)\setminus E^{% \mathrm{ex}}(\mathbb{C})$, respectively. We call a vertex in $V^{\operatorname{int}}(\mathbb{C})$ (resp., an edge in $E^{\operatorname{int}}(\mathbb{C})$) an interior-vertex (resp., interior-edge). The set $E^{\mathrm{ex}}(\mathbb{C})$ of exterior-edges forms a collection of connected graphs each of which is regarded as a rooted tree $T$ rooted at the vertex $v\in V(T)$ with the maximum $\operatorname{ht}(v)$. Let $\mathcal{T}^{\operatorname{ex}}(\langle\mathbb{C}\rangle)$ denote the set of these chemical rooted trees in $\langle\mathbb{C}\rangle$. The interior$\mathbb{C}^{\text{int }}$ of $\mathbb{C}$ is defined to be the subgraph $(V^{\operatorname{int}}(\mathbb{C}),E^{\operatorname{int}}(\mathbb{C}))$ of $\langle\mathbb{C}\rangle$.

Fig. 2 illustrates an example of a hydrogen-suppressed chemical graph $\langle\mathbb{C}\rangle$. For a branch-parameter ${\rho}=2$, the interior of the chemical graph $\langle\mathbb{C}\rangle$ in Fig. 2 is obtained by removing the set of vertices with degree 1 ${\rho}=2$ times; i.e., first remove the set $V_{1}=\{w_{1},w_{2},\ldots,w_{14}\}$ of vertices of degree 1 in $\langle\mathbb{C}\rangle$ and then remove the set $V_{2}=\{w_{15},w_{16},\ldots,w_{19}\}$ of vertices of degree 1 in $\langle\mathbb{C}\rangle-V_{1}$, where the removed vertices become the exterior-vertices of $\langle\mathbb{C}\rangle$.

Fig. 2.

An illustration of a hydrogen-suppressed chemical graph $\langle\mathbb{C}\rangle$ obtained from a chemical graph $\mathbb{C}$ with $\mathrm{r}(\mathbb{C})=4$ by removing all the hydrogens, where for ${\rho}=2$, $V^{\operatorname{ex}}(\mathbb{C})=\left\{w_{i}\mid i\in[1,19]\right\}$ and $V^{\operatorname{int}}(\mathbb{C})=\left\{u_{i}\mid i\in[1,28]\right\}$

For each interior-vertex $u\in V^{\operatorname{int}}(\mathbb{C})$, let $T_{u}\in\mathcal{T}^{\operatorname{ex}}(\langle\mathbb{C}\rangle)$ denote the chemical tree rooted at $u$ (where possibly $T_{u}$ consists of vertex $u$) and define the $\rho$-fringe-tree$\mathbb{C}[u]$ to be the chemical rooted tree obtained from $T_{u}$ by putting back the hydrogens originally attached to $T_{u}$ in $\mathbb{C}$. Let $\mathcal{T}(\mathbb{C})$ denote the set of $\rho$-fringe-trees $\mathbb{C}[u],u\in V^{\operatorname{int}}(\mathbb{C})$. Fig. 3 illustrates the set $\mathcal{T}(\mathbb{C})=\left\{\mathbb{C}\left[u_{i}\right]\mid i\in[1,28]\right\}$ of the 2-fringe-trees of the example $\mathbb{C}$ in Fig. 2.

Fig. 3.

The set $\mathbb{C}\left[u_{i}\right],i\in[1,28]$ of the example $\mathbb{C}$ in Fig. 2, where the root of each tree is depicted with a gray circle and the hydrogens attached to non-root vertices are omitted in the figure.

Feature Function We extend the feature function of a chemical graph $\mathbb{C}$ introduced by Tanaka et al. [25]. The feature of an interior-edge $e=uv\in E^{\operatorname{int}}(\mathbb{C})$ such that $\alpha(u)=\mathrm{a}$, $\deg_{\langle\mathbb{C}\rangle}(u)=d$, $\alpha(v)=\mathrm{b}$, $\deg_{\langle\mathbb{C}\rangle}(v)=d^{\prime}$ and $\beta(e)=m$ is represented by a tuple $(\mathrm{a}d,\mathrm{b}d^{\prime},m)$, which is called the edge-configuration of the edge $e$, where we call the tuple $(\mathrm{a},\mathrm{b},m)$ the adjacency-configuration of the edge $e$.

For an integer $K$, a feature vector $f(\mathbb{C})$ of a chemical graph $\mathbb{C}$ is defined by a feature function$f$ that consists of $K$ descriptors. We call $\mathbb{R}^{K}$ the feature space.

Tanaka et al. [25] defined a feature vector $f(\mathbb{C})\in\mathbb{R}^{K}$ to be a combination of the frequency of edge-configurations of the interior-edges and the frequency of chemical rooted trees among the set of chemical rooted trees $\mathbb{C}[u]$ over all interior-vertices $u$. In this paper, we introduce the rank and the adjacency-configuration of leaf-edges as new descriptors in a feature vector of a chemical graph. See Supplementary Material for a full description of descriptors used in Stage 2 of the framework.

Topological Specification

A topological specification is described as a set of the following rules proposed by Shi et al. [26] and modified by Tanaka et al. [25]:

(i) a seed graph $G_{\mathrm{C}}$ as an abstract form of a target chemical graph $\mathbb{C}$;

(ii) a set $\mathcal{F}$ of chemical rooted trees as candidates for a tree $\mathbb{C}[u]$ rooted at each exterior-vertex $u$ in $\mathbb{C}$; and

(iii) lower and upper bounds on the number of components in a target chemical graph such as chemical elements, double/triple bounds and the interior-vertices in $\mathbb{C}$.

Fig. 4a,b illustrate examples of a seed graph $G_{\mathrm{C}}$ and a set $\mathcal{F}$ of chemical rooted trees, respectively. Given a seed graph $G_{\mathrm{C}}$, the interior of a target chemical graph $\mathbb{C}$ is constructed from $G_{\mathrm{C}}$ by replacing some edges $a=uv$ with paths $P_{a}$ between the end-vertices $u$ and $v$ and by attaching new paths $Q_{v}$ to some vertices $v$.

Fig. 4.

(a) An illustration of a seed graph $G_{\mathrm{C}}$ with $\mathrm{r}(G_{\mathrm{C}})=5$ where the vertices in $V_{\mathrm{C}}$ are depicted with gray circles, the edges in $E_{(\geq 2)}$ are depicted with dotted lines, the edges in $E_{(\geq 1)}$ are depicted with dashed lines, the edges in $E_{(0/1)}$ are depicted with gray bold lines and the edges in $E_{(=1)}$ are depicted with black solid lines; (b) A set $\mathcal{F}=\{\psi_{1},\psi_{2},\ldots,\psi_{30}\}\subseteq\mathcal{F}(D_{\pi})$ of 30 chemical rooted trees $\psi_{i},i\in[1,30]$, where the root of each tree is depicted with a gray circle, where the hydrogens attached to non-root vertices are omitted in the figure.

For example, a chemical graph $\langle\mathbb{C}\rangle$ in Fig. 2 is constructed from the seed graph $G_{\mathrm{C}}$ in Fig. 4a as follows.

- First replace five edges $a_{1}=u_{1}u_{2},a_{2}=u_{1}u_{3},a_{3}=u_{4}u_{7},a_{4}=u_{10}u_{11}$ and $a_{5}=u_{11}u_{12}$ in $G_{\mathrm{C}}$ with new paths $P_{a_{1}}=(u_{1},u_{13},u_{2})$, $P_{a_{2}}=(u_{1},u_{14},u_{3})$, $P_{a_{3}}=(u_{4},u_{15},u_{16},u_{7})$, $P_{a_{4}}=(u_{10},u_{17},u_{18},u_{19},u_{11})$ and $P_{a_{5}}=(u_{11},u_{20},u_{21},u_{22},u_{12})$, respectively to obtain a subgraph $G_{1}$ of $\langle\mathbb{C}\rangle$.

- Next attach to this graph $G_{1}$ three new paths $Q_{u_{5}}=(u_{5},u_{24})$, $Q_{u_{18}}=(u_{18},u_{25},u_{26},u_{27})$ and $Q_{u_{22}}=(u_{22},u_{28})$ to obtain the interior of $\langle\mathbb{C}\rangle$ in Fig. 2.

- Finally attach to the interior 28 trees selected from the set $\mathcal{F}$ and assign chemical elements and bond-multiplicities in the interior to obtain a chemical graph $\mathbb{C}$ in Fig. 2. In Fig, 3, $\psi_{1}\in\mathcal{F}$ is selected for $\mathbb{C}\left[u_{i}\right]$, $i\in\{6,7,11\}$. Similarly $\psi_{2}$ for $\mathbb{C}[u_{9}]$, $\psi_{4}$ for $\mathbb{C}[u_{1}]$, $\psi_{6}$ for $\mathbb{C}\left[u_{i}\right]$, $i\in\{3,4,5,10,19,22,25,26\}$, $\psi_{8}$ for $\mathbb{C}[u_{8}]$, $\psi_{11}$ for $\mathbb{C}\left[u_{i}\right]$, $i\in\{2,13,16,17,20\}$, $\psi_{15}$ for $\mathbb{C}[u_{12}]$, $\psi_{19}$ for $\mathbb{C}[u_{15}]$, $\psi_{23}$ for $\mathbb{C}[u_{21}]$, $\psi_{24}$ for $\mathbb{C}[u_{24}]$, $\psi_{25}$ for $\mathbb{C}[u_{27}]$, $\psi_{26}$ for $\mathbb{C}[u_{23}]$, $\psi_{27}$ for $\mathbb{C}[u_{14}]$ and $\psi_{30}$ for $\mathbb{C}[u_{28}]$.

Our definition of a topological specification is analogous with the one by Tanaka et al. [25] except for a necessary modification due to the introduction of multiple valences of chemical elements, cations and anions (see Supplementary Material for a full description of topological specification).

4. Linear Regressions

For an integer $p\geq 1$ and a vector $x\in\mathbb{R}^{p}$, the $j$-th entry of $x$ is denoted by $x(j),j\in[1,p]$.

Let $D$ be a data set of chemical graphs $\mathbb{C}$ with an observed value $a(\mathbb{C})\in\mathbb{R}$, where we denote by $a_{i}=a(\mathbb{C}_{i})$ for an indexed graph $\mathbb{C}_{i}$.

Let $f$ be a feature function that maps a chemical graph $\mathbb{C}$ to a vector $f(\mathbb{C})\in\mathbb{R}^{K}$ where we denote by $x_{i}=f(\mathbb{C}_{i})$ for an indexed graph $\mathbb{C}_{i}$. For a prediction function $\eta:\mathbb{R}^{K}\to\mathbb{R}$, define an error function

(4)$\operatorname{Err}(\eta;D)\triangleq\sum_{\mathbb{C}_{i}\in D}\left(a_{i}-\eta% \left(f\left(\mathbb{C}_{i}\right)\right)\right)^{2}=\sum_{\mathbb{C}_{i}\in D% }\left(a_{i}-\eta\left(x_{i}\right)\right)^{2},$

and define the coefficient of determination $R^{2}(\eta,D)$ to be

(5)$\mathrm{R}^{2}(\eta,D)\triangleq 1-\frac{\operatorname{Err}(\eta;D)}{\sum_{% \mathbb{C}_{i}\in D}\left(a_{i}-\widetilde{a}\right)^{2}}\text{ for }% \widetilde{a}=\frac{1}{|D|}\sum_{\mathbb{C}\in D}a(\mathbb{C}).$

For a feature space $\mathbb{R}^{K}$, a hyperplane is denoted by a pair $(w,b)$ of a vector $w\in\mathbb{R}^{K}$ and a real $b\in\mathbb{R}$. Given a hyperplane $(w,b)\in\mathbb{R}^{K}w$, a prediction function $\eta_{w,b}:\mathbb{R}^{K}\to\mathbb{R}$ is defined by setting

(6)$\eta_{w,b}(x)\triangleq w\cdot x+b=\sum_{j\in[1,K]}w(j)x(j)+b.$

We wish to find a hyperplane $(w,b)$ that minimizes the error function $\mathrm{Err}(\eta_{w,b};D)$. In many cases, a feature vector $f$ contains descriptors that do not play an essential role in constructing a good prediction function. When we solve the minimization problem, the entries $w(j)$ for some descriptors $j\in[1,K]$ in the resulting hyperplane $(w,b)$ become zero, which means that these descriptors were not necessarily important for finding a prediction function $\eta_{w,b}$. It is proposed that solving the minimization with an additional penalty term $\tau$ to the error function often results in more number of entries $w(j)=0$, reducing a set of descriptors necessary for defining a prediction function $\eta_{w,b}$. For an error function with such a penalty term, a Ridge function $\frac{1}{2|D|}\operatorname{Err}\left(\eta_{w,b};D\right)+\lambda\left[\sum_{j% \in[1,K]}w(j)^{2}+b^{2}\right]$ [28] and a Lasso function $\frac{1}{2|D|}\operatorname{Err}\left(\eta_{w,b};D\right)+\lambda\left[\sum_{j% \in[1,K]}|w(j)|+|b|\right]$ [29] are known, where $\lambda\in\mathbb{R}$ is a given real number.

Given a prediction function $\eta_{w,b}$, we can simulate a process of computing the output $\eta_{w,b}(x)$ for an input $x\in\mathbb{R}^{K}$ as an MILP $\mathcal{M}(x,y;\mathcal{C}_{1})$ in the framework. By solving such an MILP for a specified target value $y^{*}$, we can find a vector $x^{*}\in\mathbb{R}^{K}$ such that $\eta_{w,b}(x^{*})=y^{*}$. Instead of specifying a single target value $y^{*}$, we use lower and upper bounds $\underline{y}^{*},\overline{y}^{*}\in\mathbb{R}$ on the value $a(\mathbb{C})$ of a chemical graph $\mathbb{C}$ to be inferred. We can control the range between $\underline{y}^{*}$ and $\overline{y}^{*}$ for searching a chemical graph $\mathbb{C}$ by setting $\underline{y}^{*}$ and $\overline{y}^{*}$ to be close or different values. A desired MILP is formulated as follows.

$\mathcal{M}(x,y;\mathcal{C}_{1})$: An MILP formulation for the inverse problem to prediction function.

constants:

- A hyperplane $(w,b)$ with $w\in\mathbb{R}^{K}$ and $b\in\mathbb{R}$;

- Real values $\underline{y}^{*},\overline{y}^{*}\in\mathbb{R}$ such that $\underline{y}^{*}<\overline{y}^{*}$;

- A set $I_{\mathbb{Z}}$ of indices $j\in[1,K]$ such that the $j$-th descriptor $\operatorname{dcp}_{j}(\mathbb{C})$ is always an integer;

- A set $I_{+}$ of indices $j\in[1,K]$ such that the $j$-th descriptor $\operatorname{dcp}_{j}(\mathbb{C})$ is always non-negative;

- $\ell(j),u(j)\in\mathbb{R}$, $j\in[1,K]$: lower and upper bounds on the $j$-th descriptor;

variables:

- Non-negative integer variable $x(j)\in\mathbb{Z}_{+}$, $j\in I_{\mathbb{Z}}\cap I_{+}$;

- Integer variable $x(j)\in\mathbb{Z},j\in I_{\mathbb{Z}}\setminus I_{+}$;

- Non-negative real variable $x(j)\in\mathbb{R}_{+},j\in I_{+}\setminus I_{\mathbb{Z}}$;

- Real variable $x(j)\in\mathbb{R},j\in[1,K]\setminus(I_{\mathbb{Z}}\cup I_{+})$;

constraints:

(7)$\displaystyle\ell(j)\leq x(j)\leq u(j),j\in[1,K];$
$\displaystyle\quad\underline{y}^{*}\leq\sum_{j\in[1,K]}w(j)x(j)+b\leq\bar{y}^{*}$

objective function: none.

The number of variables and constraints in the above MILP formulation is $O(K)$. It is not difficult to see that the above MILP is an NP-hard problem. The entire MILP for Stage 4 consists of the two MILPs $\mathcal{M}(x,y;\mathcal{C}_{1})$ and $\mathcal{M}(g,x;\mathcal{C}_{2})$ with no objective function. The latter represents the computation process of our feature function $f$ and a given topological specification. See Supplementary Material for the details of MILP $\mathcal{M}(g,x;\mathcal{C}_{2})$.

5. Results

We implemented our method of Stages 1 to 5 for inferring chemical graphs under a given topological specification and conducted experiments to evaluate the computational efficiency. We executed the experiments on a PC with Processor: Core i7-9700 (3.0 GHz; 4.7 GHz at the maximum) and Memory: 16 GB RAM DDR4.

Results on Phase 1. We have conducted experiments of linear regression for 37 chemical properties among which we report the following 18 properties to which the test coefficient of determination ${\rm R}^{2}$ attains at least 0.8: octanol/water partition coefficient (Kow), heat of combustion (Hc), vapor density (Vd), optical rotation (OptR), electron density on the most positive atom (EDPA), melting point (Mp), heat of atomization (Ha), heat of formation (Hf), internal energy at 0K (U0), energy of lowest unoccupied molecular orbital (Lumo), isotropic polarizability (Alpha), heat capacity at 298.15K (Cv), solubility (Sl), surface tension (SfT), viscosity (Vis), isobaric heat capacities in liquid phase (IhcLiq), isobaric heat capacities in solid phase (IhcSol) and lipophilicity (Lp).

We used data sets provided by HSDB from PubChem [30]for Kow, Hc, Vd and OptR, M. Jalali-Heravi and M. Fatemi [31] for EDPA, Roy and Saha [32] for Mp, Ha and Hf, MoleculeNet [33] for U0, Lumo, Alpha, Cv and Sl, Goussard et al. [34] for SfT and Vis, R. Naef [35] for IhcLiq and IhcSol, and Figshare [36] for Lp.

Properties U0, Lumo, Alpha and Cv share a common original data set $D^{*}$ with more than 130,000 compounds, and we used a set $D_{\pi}$ of 1,000 graphs randomly selected from $D^{*}$ as a common data set of these four properties $\pi$ in this experiment.

Stages 1, 2 and 3 in Phase 1 are implemented as follows.

Stage 1. We set a graph class $\mathcal{G}$ to be the set of all chemical graphs with any graph structure, and set a branch-parameter ${\rho}$ to be 2.

For each of the properties, we first select a set $\Lambda$ of chemical elements and then collect a data set $D_{\pi}$ on chemical graphs over the set $\Lambda$ of chemical elements. During construction of the data set $D_{\pi}$, chemical compounds that do not satisfy one of the following are eliminated: the graph is connected, the number of non-hydrogen neighbors of each atom is at most four, and the number of carbon atoms is at least four.

Table 1 shows the size and range of data sets that we prepared for each chemical property in Stage 1, where we denote the following:

- $\Lambda$: the set of elements used in the data set $D_{\pi}$; $\Lambda$ is one of the following 11 sets: $\Lambda_{1}=\{\mathrm{H},\mathrm{C},\mathrm{O}\}$; $\Lambda_{2}=\{\mathrm{H},\mathrm{C},\mathrm{O},\mathrm{N}\}$; $\Lambda_{3}=\{\mathrm{H},\mathrm{C},\mathrm{O},\mathrm{S}_{(2)}\}$; $\Lambda_{4}=\{\mathrm{H},\mathrm{C},\mathrm{O},\mathrm{S}\text{i}\}$; $\Lambda_{5}=\{\mathrm{H},\mathrm{C},\mathrm{O},\mathrm{N},\mathrm{C}\text{l},% \mathrm{P}_{(3)},\mathrm{P}_{(5)}\}$; $\Lambda_{6}=\{\mathrm{H},\mathrm{C},\mathrm{O},\mathrm{N},\mathrm{S}_{(2)},% \mathrm{F}\}$; $\Lambda_{7}=\{\mathrm{H},\mathrm{C},\mathrm{O},\mathrm{N},\mathrm{S}_{(2)},% \mathrm{S}_{(6)},\mathrm{C}\text{l}\}$; $\Lambda_{8}=\{\mathrm{H},\mathrm{C}_{(2)},\mathrm{C}_{(3)},\mathrm{C}_{(4)},% \mathrm{O},\mathrm{N}_{(2)},\mathrm{N}_{(3)}\}$; $\Lambda_{9}=\{\mathrm{H},\mathrm{C},\mathrm{O},\mathrm{N},\mathrm{S}_{(2)},% \mathrm{S}_{(4)},\mathrm{S}_{(6)},\mathrm{C}\text{l}\}$; $\Lambda_{10}=\{\mathrm{H},\mathrm{C}_{(2)},\mathrm{C}_{(3)},\mathrm{C}_{(4)},% \mathrm{C}_{(5)},\mathrm{O},\mathrm{N}_{(1)},\mathrm{N}_{(2)},\mathrm{N}_{(3)}% ,\mathrm{F}\}$; and $\Lambda_{11}=\{\mathrm{H},\mathrm{C}_{(2)},\mathrm{C}_{(3)},\mathrm{C}_{(4)},% \mathrm{O},\mathrm{N}_{(2)},\mathrm{N}_{(3)},\mathrm{S}_{(2)},\mathrm{S}_{(4)}% ,\mathrm{S}_{(6)},\mathrm{C}\text{l}\}$, where ${\tt e}_{(i)}$ for a chemical element ${\tt e}$ and an integer $i\geq 1$ means that a chemical element ${\tt e}$ with valence $i$.

- $|D_{\pi}|$: the size of data set $D_{\pi}$ over $\Lambda$ for the property $\pi$.

- $\underline{n},~{}\overline{n}$: the minimum and maximum values of the number $n(\mathbb{C})$ of non-hydrogen atoms in compounds $\mathbb{C}$ in $D_{\pi}$.

- $\underline{a},~{}\overline{a}$: the minimum and maximum values of $a(\mathbb{C})$ for $\pi$ over compounds $\mathbb{C}$ in $D_{\pi}$.

- $|\Gamma|$: the number of different edge-configurations of interior-edges over the compounds in $D_{\pi}$.

- $|\mathcal{F}|$: the number of non-isomorphic chemical rooted trees in the set of all 2-fringe-trees in the compounds in $D_{\pi}$.

- $K$: the number of descriptors in a feature vector $f(\mathbb{C})$.

Table 1.Results in Phase 1.
 $\pi$ $\Lambda$ $|D_{\pi}|$ $\underline{n},~{}\overline{n}$ $\underline{a},~{}\overline{a}$ $|\Gamma|$ $|\mathcal{F}|$ $K$ $\lambda_{\pi}$ $K^{\prime}$ test $\mathrm{R}^{2}$ Kow $\Lambda_{2}$ 684 4, 58 –7.5, 15.6 25 166 223 $6.4\mathrm{E}{-5}$ 80.3 0.953 Kow $\Lambda_{9}$ 899 4, 69 –7.5, 15.6 37 219 303 $5.5\mathrm{E}{-5}$ 112.1 0.927 Hc $\Lambda_{2}$ 255 4, 63 49.6, 35099.6 17 106 154 $1.9\mathrm{E}{-4}$ 19.2 0.946 Hc $\Lambda_{7}$ 282 4, 63 49.6, 35099.6 21 118 177 $1.9\mathrm{E}{-4}$ 20.5 0.951 Vd $\Lambda_{2}$ 474 4, 30 0.7, 20.6 21 160 214 $1.0\mathrm{E}{-3}$ 3.6 0.927 Vd $\Lambda_{5}$ 551 4, 30 0.7, 20.6 24 191 256 $5.5\mathrm{E}{-4}$ 8.0 0.942 OptR $\Lambda_{2}$ 147 5, 44 –117.0, 165.0 21 55 107 $4.6\mathrm{E}{-4}$ 39.2 0.823 OptR $\Lambda_{6}$ 157 5, 69 –117.0, 165.0 25 62 123 $7.3\mathrm{E}{-4}$ 41.7 0.825 EDPA $\Lambda_{1}$ 52 11, 16 0.80, 3.76 9 33 64 $1.0\mathrm{E}{-4}$ 10.9 0.999 Mp $\Lambda_{2}$ 467 4, 122 –185.33, 300.0 23 142 197 $3.7\mathrm{E}{-5}$ 82.5 0.817 Ha $\Lambda_{3}$ 115 4, 11 1100.6, 3009.6 8 83 115 $3.7\mathrm{E}{-5}$ 39.0 0.997 Hf $\Lambda_{1}$ 82 4, 16 30.2, 94.8 5 50 74 $1.0\mathrm{E}{-4}$ 34.0 0.987 U0 $\Lambda_{10}$ 977 4, 9 –570.6,  –272.8 59 190 297 $1.0\mathrm{E}{-7}$ 246.7 0.999 Lumo $\Lambda_{10}$ 977 4, 9 –0.11, 0.10 59 190 297 $6.4\mathrm{E}{-5}$ 133.9 0.841 Alpha $\Lambda_{10}$ 977 4, 9 50.9, 99.6 59 190 297 $1.0\mathrm{E}{-5}$ 125.5 0.961 Cv $\Lambda_{10}$ 977 4, 9 19.2, 44.0 59 190 297 $1.0\mathrm{E}{-5}$ 165.3 0.961 Sl $\Lambda_{9}$ 915 4, 55 –11.6, 1.11 42 207 300 $7.3\mathrm{E}{-5}$ 130.6 0.808 SfT $\Lambda_{4}$ 247 5, 33 12.3, 45.1 11 91 128 $6.4\mathrm{E}{-4}$ 20.9 0.804 Vis $\Lambda_{4}$ 282 5, 36 –0.64, 1.63 12 88 126 $8.2\mathrm{E}{-4}$ 16.3 0.893 IhcLiq $\Lambda_{2}$ 770 4, 78 106.3, 1956.1 23 200 256 $1.9\mathrm{E}{-5}$ 82.2 0.987 IhcLiq $\Lambda_{7}$ 865 4, 78 106.3, 1956.1 29 246 316 $8.2\mathrm{E}{-6}$ 139.1 0.986 IhcSol $\Lambda_{8}$ 581 5, 70 67.4, 1220.9 33 124 192 $2.8\mathrm{E}{-5}$ 75.9 0.985 IhcSol $\Lambda_{11}$ 668 5, 70 67.4, 1220.9 40 140 228 $2.8\mathrm{E}{-5}$ 86.7 0.982 Lp $\Lambda_{2}$ 615 6, 60 –3.62, 6.84 32 116 186 $1.0\mathrm{E}{-4}$ 98.5 0.856 Lp $\Lambda_{9}$ 936 6, 74 –3.62, 6.84 44 136 231 $6.4\mathrm{E}{-5}$ 130.4 0.840

Stage 2. The newly defined feature function in our chemical model without suppressing hydrogen in Section 3 is used. We standardize the range of each descriptor and the range $\{t\in\mathbb{R}\mid\underline{a}\leq t\leq\bar{a}\}$ of property values $a(\mathbb{C}),\mathbb{C}\in D_{\pi}$.

Stage 3. For each chemical property $\pi$, we select a penalty value $\lambda_{\pi}$ in the Lasso function from 36 different values from 0 to 100 by conducting linear regression as a preliminary experiment.

We conducted an experiment in Stage 3 to evaluate the performance of the prediction function based on cross-validation. For a property $\pi$, an execution of a cross-validation consists of five trials of constructing a prediction function as follows. First partition the data set $D_{\pi}$ into five subsets $D^{(k)}$, $k\in[1,5]$ randomly. For each $k\in[1,5]$, the $k$-th trial constructs a prediction function $\eta^{(k)}$ by conducting a linear regression with the penalty term $\lambda_{\pi}$ using the set $D_{\pi}\setminus D^{(k)}$ as a training data set. We used scikit-learn version 0.23.2 with Python 3.8.5 for executing linear regression with Lasso function. For each property, we executed ten cross-validations and we show the median of test $\mathrm{R}^{2}(\eta^{(k)},D^{(k)}),k\in[1,5]$ over all ten cross-validations. Recall that a subset of descriptors is selected in linear regression with Lasso function and let $K^{\prime}$ denote the average number of selected descriptors over all 50 trials. The running time per trial in a cross-validation was at most one second.

Table 1 shows the results on Stages 2 and 3, where we denote the following:

- $\lambda_{\pi}$: the penalty value in the Lasso function selected for a property $\pi$, where $a\mathrm{E}{b}$ means $a\times 10^{b}$.

- $K^{\prime}$: the average of the number of descriptors selected in the linear regression over all 50 trials in ten cross-validations.

- test $\mathrm{R}^{2}$: the median of test $\mathrm{R}^{2}$ over all 50 trials in ten cross-validations.

Recall that the adjacency-configuration for leaf-edges was introduced as a new descriptor in this paper. Without including this new descriptor, the test $\mathrm{R}^{2}$ for property Vis was 0.790, that for Lumo was 0.799 and that for Mp was 0.796, while the test $\mathrm{R}^{2}$ for each of the other properties in Table 1 was almost the same.

From Table 1, we observe that a relatively large number of properties admit a good prediction function based on linear regression. The number $K^{\prime}$ of descriptors used in linear regression is considerably small for some properties. For example of property Vd, the four descriptors most frequently selected in the case of $\Lambda=\{\mathrm{H},\mathrm{O},\mathrm{C},\mathrm{N}\}$ are the number of non-hydrogen atoms; the number of interior-vertices $v$ with $\text{deg}_{\mathrm{C}^{\text{int}}}(v)=1$; the number of fringe-trees r-isomorphic to the chemical rooted tree $\psi_{1}$ in Fig. 5; and the number of leaf-edges with adjacency-configuration $(\mathrm{O},\mathrm{C},2)$. The eight descriptors most frequently selected in the case of $\Lambda=\left\{\mathrm{H},\mathrm{O},\mathrm{C},\mathrm{N},\mathrm{C}\text{l},% \mathrm{P}_{(3)},\mathrm{P}_{(5)}\right\}$ are the number of non-hydrogen atoms; the number of interior-vertices $v$ with $\text{deg}_{\mathrm{C}^{\text{int}}}(v)=1$; the number of exterior-vertices $v$ with $\alpha(v)=\mathrm{C}\text{l}$; the number of interior-edges with edge-configuration $\gamma_{i},i=1,2$, where $\gamma_{1}=(\mathrm{C}2,\mathrm{C}2,2)$ and $\gamma_{2}=(\mathrm{C}3,\mathrm{C}4,1)$; and the number of fringe-trees r-isomorphic to the chemical rooted tree $\psi_{i},i=1,2,3$ in Fig. 5.

Fig. 5.

An illustration of chemical rooted trees $\psi_{1}$, $\psi_{1}$ and $\psi_{3}$ that are selected in Lasso linear regression for constructing a prediction function to property Vd, where the root is depicted with a gray circle.

For the 18 properties listed in Table 1, we used ANN to construct prediction functions. For this purpose, we used our newly proposed feature vector and the experimental setup as explained in Tanaka et al. [25]. From these computation experiments, we observe that for the properties Hc, Vd, Ha, Hf, U0, Alpha and Cv, the test $\mathrm{R}^{2}$ scores of the prediction functions obtained by Lasso linear regression is at least 0.05 more than those obtained by ANN. For the properties OptR, Sl and SfT, the test $\mathrm{R}^{2}$ scores of the prediction functions obtained by ANN is at least 0.05 more than those obtained by Lasso linear regression. For the other properties, the test $\mathrm{R}^{2}$ scores obtained by Lasso linear regression and ANN are comparable.

Results on Phase 2. We used a set of seven instances $I_{\mathrm{a}}$, $I_{\mathrm{b}}^{i},i\in[1,4]$, $I_{\mathrm{c}}$ and $I_{\mathrm{d}}$ based on seed graphs prepared by Shi et al. [26] to execute Stages 4 and 5 in Phase 2. We here present their seed graphs $G_{\mathrm{C}}$ (see Supplementary Material for the details of instances $I_{\mathrm{a}}$, $I_{\mathrm{b}}^{i},i\in[1,4]$, $I_{\mathrm{c}}$ and $I_{\mathrm{d}}$). The seed graph $G_{\mathrm{C}}$ of instance $I_{\mathrm{a}}$ is illustrated in Fig. 4a. The seed graph $G_{\mathrm{C}}^{1}$ (resp., $G_{\mathrm{C}}^{i},i=2,3,4$) of instances $I_{\mathrm{b}}^{1}$ and $I_{\mathrm{d}}$ (resp., $I_{\mathrm{b}}^{i},i=2,3,4$) is illustrated in Fig. 6.

Fig. 6.

(i) Seed graph $G_{\mathrm{C}}^{1}$ for $I_{\mathrm{b}}^{1}$ and $I_{\mathrm{d}}$; (ii) Seed graph $G_{\mathrm{C}}^{2}$ for $I_{\mathrm{b}}^{2}$; (iii) Seed graph $G_{\mathrm{C}}^{3}$ for $I_{\mathrm{b}}^{3}$; (iv) Seed graph $G_{\mathrm{C}}^{4}$ for $I_{\mathrm{b}}^{4}$.

Instance $I_{\mathrm{c}}$ has been introduced by Shi et al. [26] in order to infer a chemical graph $\mathbb{C}^{\dagger}$ such that the core of $\mathbb{C}^{\dagger}$ is the same as the core of chemical graph $\mathbb{C}_{A}$: CID 24822711 in Fig. 7a and the frequency of each edge-configuration in the non-core of $\mathbb{C}^{\dagger}$ is the same as that of chemical graph $\mathbb{C}_{B}$: CID 59170444 illustrated in Fig. 7b. This means that the seed graph $G_{\mathrm{C}}$ of $I_{\mathrm{c}}$ is the core of $\mathbb{C}_{A}$ which is indicated by a shaded area in Fig. 7a.

Instance $I_{\mathrm{d}}$ has been introduced by Shi et al. [26] in order to infer a monocyclic chemical graph $\mathbb{C}^{\dagger}$ such that the frequency vector of edge-configurations in $\mathbb{C}^{\dagger}$ is a vector obtained by merging those of two chemical graphs $\mathbb{C}_{A}$: CID 10076784 and $\mathbb{C}_{B}$: CID 44340250 illustrated in Fig. 7c,d, respectively.

Fig. 7.

An illustration of chemical compounds for instances $I_{\rm c}$ and $I_{\rm d}$: (a) $\mathbb{C}_{A}$: CID 24822711; (b) $\mathbb{C}_{B}$: CID 59170444; (c) $\mathbb{C}_{A}$: CID 10076784; (d) $\mathbb{C}_{B}$: CID 44340250, where hydrogens are omitted.

Stage 4. We executed Stage 4 for five properties $\pi\in\{$Hc, Vd, OptR, IhcLiq, Vis$\}$.

For the MILP formulation $\mathcal{M}(x,y;\mathcal{C}_{1})$ in Section 4, we use the prediction function $\eta_{w,b}$ that attained the median test $\mathrm{R}^{2}$ in Table 1. We used CPLEX version 12.10 to solve an MILP in Stage 4. Tables 2,3,4,5,6 show the computational results of the experiment in Stage 4 for the five properties, where we denote the following:

- $\underline{y}^{*},~{}\overline{y}^{*}$: lower and upper bounds $\underline{y}^{*}$, $\overline{y}^{*}\in\mathbb{R}$ on the value $a(\mathbb{C})$ of a chemical graph $\mathbb{C}$ to be inferred;

- $\#$v (resp., $\#$c): the number of variables (resp., constraints) in the MILP in Stage 4;

- I-time: the time (sec.) to solve the MILP in Stage 4;

- $n$: the number $n\left(\mathbb{C}^{\dagger}\right)$ of non-hydrogen atoms in the chemical graph $\mathbb{C}^{\dagger}$ inferred in Stage 4; and

- $\mathrm{n}^{\mathrm{int}}$: the number $\mathrm{n}^{\mathrm{int}}\left(\mathbb{C}^{\dagger}\right)$ of interior-vertices in the chemical graph $\mathbb{C}^{\dagger}$ inferred in Stage 4;

- $\eta(f(\mathbb{C}^{\dagger}))$: the predicted property value $\eta(f(\mathbb{C}^{\dagger}))$ of the chemical graph $\mathbb{C}^{\dagger}$ inferred in Stage 4.

Table 2.Results of Stages 4 and 5 for Hc using Lasso linear regression.
 inst. $\underline{y}^{*},~{}\overline{y}^{*}$ $\#$v $\#$c I-time $n$ $\mathrm{n}^{\mathrm{int}}$ $\eta(f(\mathbb{C}^{\dagger}))$ D-time $\mathbb{C}$-LB $\#\mathbb{C}$ $I_{\mathrm{a}}$ 5950, 6050 9902 9255 4.6 44 25 5977.9 0.068 1 1 $I_{\mathrm{b}}^{1}$ 5950, 6050 9404 6776 1.7 36 10 6007.1 0.048 6 6 $I_{\mathrm{b}}^{2}$ 5950, 6050 11729 9891 16.7 50 25 6043.7 38.7 $2.4\!\!\times\!\!10^{5}$ 100 $I_{\mathrm{b}}^{3}$ 5950, 6050 11510 9894 16.3 47 25 6015.4 0.353 8724 100 $I_{\mathrm{b}}^{4}$ 5950, 6050 11291 9897 9.0 49 26 5971.6 0.304 84 84 $I_{\mathrm{c}}$ 13700, 13800 6915 7278 0.7 50 33 13703.3 0.016 1 1 $I_{\mathrm{d}}$ 13700, 13800 5535 6781 4.9 44 23 13704.7 0.564 $4.3\!\!\times\!\!10^{5}$ 100
Table 3.Results of Stages 4 and 5 for Vd using Lasso linear regression.
 inst. $\underline{y}^{*},~{}\overline{y}^{*}$ $\#$v $\#$c I-time $n$ $\mathrm{n}^{\mathrm{int}}$ $\eta(f(\mathbb{C}^{\dagger}))$ D-time $\mathbb{C}$-LB $\#\mathbb{C}$ $I_{\mathrm{a}}$ 16, 17 9481 9358 1.6 38 23 16.83 0.070 1 1 $I_{\mathrm{b}}^{1}$ 16, 17 9928 6986 1.5 35 12 16.68 0.206 48 48 $I_{\mathrm{b}}^{2}$ 21, 22 12373 10101 10.0 48 25 21.62 0.104 20 20 $I_{\mathrm{b}}^{3}$ 21, 22 12159 10104 6.5 48 25 21.95 3.65 $8.6\!\!\times\!\!10^{5}$ 100 $I_{\mathrm{b}}^{4}$ 21, 22 11945 10107 8.1 48 25 21.34 0.057 6 6 $I_{\mathrm{c}}$ 21, 22 7073 7438 0.7 50 34 21.89 0.016 1 1 $I_{\mathrm{d}}$ 17, 18 5693 6942 2.1 41 23 17.94 0.161 216 100
Table 4.Results of Stages 4 and 5 for OptR using Lasso linear regression.
 inst. $\underline{y}^{*},~{}\overline{y}^{*}$ $\#$v $\#$c I-time $n$ $\mathrm{n}^{\mathrm{int}}$ $\eta(f(\mathbb{C}^{\dagger}))$ D-time $\mathbb{C}$-LB $\#\mathbb{C}$ $I_{\mathrm{a}}$ 70, 71 8962 9064 3.5 40 23 70.1 0.061 1 1 $I_{\mathrm{b}}^{1}$ 70, 71 9432 6662 2.7 37 14 70.1 0.185 2622 100 $I_{\mathrm{b}}^{2}$ 70, 71 11818 9773 10.0 50 25 70.8 0.041 4 4 $I_{\mathrm{b}}^{3}$ 70, 71 11602 9776 10.2 50 25 70.2 0.241 60 60 $I_{\mathrm{b}}^{4}$ 70, 71 11386 9779 24.7 49 25 70.9 6.39 $4.6\!\!\times\!\!10^{5}$ 100 $I_{\mathrm{c}}$ –112, –111 6807 7170 1.8 50 32 -111.9 0.016 1 1 $I_{\mathrm{d}}$ 70, 71 5427 6673 6.1 42 23 70.2 0.127 78768 100
Table 5.Results of Stages 4 and 5 for IhcLiq using Lasso linear regression.
 inst. $\underline{y}^{*},~{}\overline{y}^{*}$ $\#$v $\#$c I-time $n$ $\mathrm{n}^{\mathrm{int}}$ $\eta(f(\mathbb{C}^{\dagger}))$ D-time $\mathbb{C}$-LB $\#\mathbb{C}$ $I_{\mathrm{a}}$ 1190, 1210 10180 9538 3.9 48 26 1208.5 0.071 2 2 $I_{\mathrm{b}}^{1}$ 1190, 1210 10784 7191 2.4 35 14 1206.7 0.082 12 12 $I_{\mathrm{b}}^{2}$ 1190, 1210 13482 10302 14.1 47 25 1206.7 0.11 12 12 $I_{\mathrm{b}}^{3}$ 1190, 1210 13275 10301 9.0 49 27 1209.9 0.090 24 24 $I_{\mathrm{b}}^{4}$ 1190, 1210 13128 10306 16.5 50 25 1208.4 0.424 2388 100 $I_{\mathrm{c}}$ 1190, 1210 7193 7560 0.8 50 33 1196.5 0.016 1 1 $I_{\mathrm{d}}$ 1190, 1210 5813 7063 2.2 44 23 1198.8 5.63 $5.2\!\!\times\!\!10^{5}$ 100
Table 6.Results of Stages 4 and 5 for Vis using Lasso linear regression.
 inst. $\underline{y}^{*},~{}\overline{y}^{*}$ $\#$v $\#$c I-time $n$ $\mathrm{n}^{\mathrm{int}}$ $\eta(f(\mathbb{C}^{\dagger}))$ D-time $\mathbb{C}$-LB $\#\mathbb{C}$ $I_{\mathrm{a}}$ 1.25, 1.30 6847 8906 1.3 38 22 1.295 0.042 2 2 $I_{\mathrm{b}}^{1}$ 1.25, 1.30 7270 6397 2.5 36 15 1.272 0.155 140 100 $I_{\mathrm{b}}^{2}$ 1.85, 1.90 8984 9512 8.9 45 25 1.879 0.149 288 100 $I_{\mathrm{b}}^{3}$ 1.85, 1.90 8741 9515 16.2 45 26 1.880 0.137 4928 100 $I_{\mathrm{b}}^{4}$ 1.85, 1.90 8498 9518 8.1 45 25 1.851 0.13 660 100 $I_{\mathrm{c}}$ 2.75, 2.80 6813 7162 1.0 50 33 2.763 0.025 4 4 $I_{\mathrm{d}}$ 1.85, 1.90 5433 6665 2.7 41 23 1.881 0.138 4608 100

From Tables 2,3,4,5,6 we observe that an instance with a large number of variables and constraints takes more running time than those with a smaller size in general. We solved all instances in this experiment with our MILP formulation in a few seconds to around 30 seconds.

Fig. 8a–e illustrate the chemical graphs $\mathbb{C}^{\dagger}$ inferred from $I_{\mathrm{c}}$ with $(\underline{y}^{*},\overline{y}^{*})=(13700,13800)$ of Hc, $I_{\mathrm{b}}^{2}$ with $(\underline{y}^{*},\overline{y}^{*})=(21,22)$ of Vd, $I_{\mathrm{b}}^{4}$ with $(\underline{y}^{*},\overline{y}^{*})=(70,71)$ of OptR, $I_{\mathrm{d}}$ with $(\underline{y}^{*},\overline{y}^{*})=(1190,1210)$ of IhcLiq, and $I_{\mathrm{b}}^{3}$ with $(\underline{y}^{*},\overline{y}^{*})=(1.85,1.90)$ of Vis, respectively.

Fig. 8.

(a) $\mathbb{C}^{\dagger}$ with $\eta(f(\mathbb{C}^{\dagger}))=13703.3$ inferred from $I_{\mathrm{c}}$ with $(\underline{y}^{*},\overline{y}^{*})=(13700,13800)$ of Hc; (b) $\mathbb{C}^{\dagger}$ with $\eta(f(\mathbb{C}^{\dagger}))=21.62$ inferred from $I_{\mathrm{b}}^{2}$ with $(\underline{y}^{*},\overline{y}^{*})=(21,22)$ of Vd; (c) $\mathbb{C}^{\dagger}$ with $\eta(f(\mathbb{C}^{\dagger}))=70.9$ inferred from $I_{\mathrm{b}}^{4}$ with $(\underline{y}^{*},\overline{y}^{*})=(70,71)$ of OptR; (d) $\mathbb{C}^{\dagger}$ with $\eta(f(\mathbb{C}^{\dagger}))=1198.8$ inferred from $I_{\mathrm{d}}$ with $(\underline{y}^{*},\overline{y}^{*})=(1190,1210)$ of IhcLiq; (e) $\mathbb{C}^{\dagger}$ with $\eta(f(\mathbb{C}^{\dagger}))=1.880$ inferred from $I_{\mathrm{b}}^{3}$ with $(\underline{y}^{*},\overline{y}^{*})=(1.85,1.90)$ of Vis; (f) $\mathbb{C}^{\dagger}$ inferred from $I_{\mathrm{b}}^{4}$ with lower and upper bounds on the predicted property value $\eta_{\pi}(f(\mathbb{C}^{\dagger}))$ of property $\pi\in\{$Kow, Lp, Sl$\}$ in Table 9.

Similarly, we executed Stage 4 for these seven instances $I_{\mathrm{a}}$, $I_{\mathrm{b}}^{i},i\in[1,4]$, $I_{\mathrm{c}}$ and $I_{\mathrm{d}}$ for five properties $\pi\in\{$Hc, Vd, OptR, IhcLiq, Vis$\}$ by using the prediction functions obtained by ANN. We list the running time to solve MILP formulation for each of these instances in Tables 7,8. From the computation experiments, we observe that for many instances, the running time is significantly faster than that of Stage 4 based on ANN.

Table 7.Running time of Stage 4 for Hc, Vd and OptR using ANN.
 Hc Vd OptR inst. $\underline{y}^{*},~{}\overline{y}^{*}$ I-time inst. $\underline{y}^{*},~{}\overline{y}^{*}$ I-time inst. $\underline{y}^{*},~{}\overline{y}^{*}$ I-time $I_{\mathrm{a}}$ 13350, 13450 24.7 $I_{\mathrm{a}}$ 18, 19 18.1 $I_{\mathrm{a}}$ 62, 63 35.6 $I_{\mathrm{b}}^{1}$ 9650, 9750 13.5 $I_{\mathrm{b}}^{1}$ 13, 14 9.4 $I_{\mathrm{b}}^{1}$ 109, 110 15.5 $I_{\mathrm{b}}^{2}$ 16750, 16850 70.4 $I_{\mathrm{b}}^{2}$ 15, 16 40.9 $I_{\mathrm{b}}^{2}$ 23, 24 192.6 $I_{\mathrm{b}}^{3}$ 12350, 12450 87.0 $I_{\mathrm{b}}^{3}$ 20, 21 46.3 $I_{\mathrm{b}}^{3}$ -2, -1 936.4 $I_{\mathrm{b}}^{4}$ 14250, 14350 70.9 $I_{\mathrm{b}}^{4}$ 22, 23 27.1 $I_{\mathrm{b}}^{4}$ 19, 20 63.9 $I_{\mathrm{c}}$ 10400, 10500 31.3 $I_{\mathrm{c}}$ 20, 21 20.5 $I_{\mathrm{c}}$ 86, 87 16.4 $I_{\mathrm{d}}$ 12500, 12600 44.3 $I_{\mathrm{d}}$ 18, 19 6.1 $I_{\mathrm{d}}$ 30, 31 31.8
Table 8.Running time of Stage 4 for IhcLiq and Vis using ANN.
 IhcLiq Vis inst. $\underline{y}^{*},~{}\overline{y}^{*}$ I-time inst. $\underline{y}^{*},~{}\overline{y}^{*}$ I-time $I_{\mathrm{a}}$ 980, 1000 56.6 $I_{\mathrm{a}}$ 1.85, 1.90 2.0 $I_{\mathrm{b}}^{1}$ 1000, 1020 40.4 $I_{\mathrm{b}}^{1}$ 1.95, 2.00 3.5 $I_{\mathrm{b}}^{2}$ 1130, 1150 71.6 $I_{\mathrm{b}}^{2}$ 1.85, 1.90 19.7 $I_{\mathrm{b}}^{3}$ 1240, 1260 45.0 $I_{\mathrm{b}}^{3}$ 2.35, 2.40 26.0 $I_{\mathrm{b}}^{4}$ 1240, 1260 105.7 $I_{\mathrm{b}}^{4}$ 2.50, 2.55 9.3 $I_{\mathrm{c}}$ 810, 830 9.7 $I_{\mathrm{c}}$ 3.90, 3.95 1.8 $I_{\mathrm{d}}$ 1100, 1120 25.8 $I_{\mathrm{d}}$ 3.30, 3.35 8.3

Inferring a chemical graph with target values in multiple properties Once we obtained prediction functions $\eta_{\pi}$ for several properties $\pi$, include MILP formulations for these functions $\eta_{\pi}$ into a single MILP $\mathcal{M}(x,y;\mathcal{C}_{1})$ so as to infer a chemical graph that satisfies given target values $y^{*}$ for these properties at the same time. As an additional experiment in Stage 4, we inferred a chemical graph that has a desired predicted value each of three properties Kow, Lp and Sl, where we used the prediction function $\eta_{\pi}$ for each property $\pi\in\{$Kow, Lp, Sl$\}$ constructed in Stage 3. Table 9 shows the result of Stage 4 for inferring a chemical graph $\mathbb{C}^{\dagger}$ from instances $I_{\mathrm{b}}^{2}$, $I_{\mathrm{b}}^{3}$ and $I_{\mathrm{b}}^{4}$ with $\Lambda=\left\{\mathrm{H},\mathrm{C},\mathrm{N},\mathrm{O},\mathrm{S}_{(2)},% \mathrm{S}_{(6)},\mathrm{Cl}\right\}$, where we denote the following:

- $\pi$: one of the three properties Kow, Lp and Slused in the experiment;

- $\underline{y}^{*}_{\pi},~{}\overline{y}^{*}_{\pi}$: lower and upper bounds $\underline{y}^{*}_{\pi},\overline{y}^{*}_{\pi}\in\mathbb{R}$ on the predicted property value $\eta_{\pi}(f(\mathbb{C}^{\dagger}))$ of property $\pi\in\{$Kow, Lp, Sl$\}$ for a chemical graph $\mathbb{C}^{\dagger}$ to be inferred;

- $\#$v (resp., $\#$c): the number of variables (resp., constraints) in the MILP in Stage 4;

- I-time: the time (sec.) to solve the MILP in Stage 4;

- $n$: the number $n(\mathbb{C}^{\dagger})$ of non-hydrogen atoms in the chemical graph $\mathbb{C}^{\dagger}$ inferred in Stage 4;

- $n^{\text{int }}$: the number $n^{\text{int }}(\mathbb{C}^{\dagger})$ of interior-vertices in the chemical graph $\mathbb{C}^{\dagger}$ inferred in Stage 4; and

- $\eta_{\pi}(f(\mathbb{C}^{\dagger}))$: the predicted property value $\eta_{\pi}(f(\mathbb{C}^{\dagger}))$ of property $\pi\in\{$Kow, Lp, Sl$\}$ for the chemical graph $\mathbb{C}^{\dagger}$ inferred in Stage 4.

Table 9.Results of Stage 4 for instances $I_{\mathrm{b}}^{i},i=2,3,4$ with specified target values of three properties Kow, Lp and Sl using Lasso linear regression.
 inst. $\pi$ $\underline{y}_{\pi}^{*},~{}\overline{y}_{\pi}^{*}$ $\#$v $\#$c I-time $n$ $n^{\text{int }}$ $\eta_{\pi}(f(\mathbb{C}^{\dagger}))$ Kow –7.50, –7.40 –7.41 $I_{\mathrm{b}}^{2}$ Lp –1.40, –1.30 14574 11604 62.7 50 30 –1.33 Sl –11.6, –11.5 –11.52 Kow –7.40, –7.30 –7.38 $I_{\mathrm{b}}^{3}$ Lp –2.90, –2.80 14370 11596 35.5 48 25 –2.81 Sl –11.6, –11.4 –11.52 Kow –7.50, –7.40 –7.48 $I_{\mathrm{b}}^{4}$ Lp –0.70, –0.60 14166 11588 71.7 49 26 –0.63 Sl –11.4, –11.2 –11.39

Fig. 8f illustrates the chemical graph $\mathbb{C}^{\dagger}$ inferred from $I_{\mathrm{b}}^{4}$ with $(\underline{y}^{*}_{\pi_{1}},\overline{y}^{*}_{\pi_{1}})=(-7.50,$$-7.40)$, $(\underline{y}^{*}_{\pi_{2}},\overline{y}^{*}_{\pi_{2}})=(-0.70,-0.60)$ and $(\underline{y}^{*}_{\pi_{3}},\overline{y}^{*}_{\pi_{3}})=(-11.4,-11.2)$ for $\pi_{1}=$Kow, $\pi_{2}=$Lp and $\pi_{3}=$Sl, respectively.

Stage 5. We executed Stage 5 to generate more target chemical graphs $\mathbb{C}^{*}$, where a chemical graph $\mathbb{C}^{*}$ is called a chemical isomer of a target chemical graph $\mathbb{C}^{\dagger}$ of a topological specification $\sigma$ if $f(\mathbb{C}^{*})=f(\mathbb{C}^{\dagger})$ and $\mathbb{C}^{*}$ also satisfies the same topological specification $\sigma$. We computed chemical isomers $\mathbb{C}^{*}$ of each target chemical graph $\mathbb{C}^{\dagger}$ inferred in Stage 4. We executed an algorithm to generate chemical isomers of $\mathbb{C}^{\dagger}$ up to 100 when the number of all chemical isomers exceeds 100. We can obtain such an algorithm from the dynamic programming proposed by Tanaka et al. [25] with a slight modification. The algorithm first decomposes $\mathbb{C}^{\dagger}$ into a set of acyclic chemical graphs, next replaces each acyclic chemical graph $T$ with another acyclic chemical graph $T^{\prime}$ that admits the same feature vector as that of $T$, and finally assembles the resulting acyclic chemical graphs into a chemical isomer $\mathbb{C}^{*}$ of $\mathbb{C}^{\dagger}$. Also, a lower bound on the total number of all chemical isomers of $\mathbb{C}^{\dagger}$ can be computed by the algorithm without generating all of them.

Tables 2,3,4,5,6 show the computational results of the experiment in Stage 5 for the five properties, where we denote the following:

- D-time: the running time (sec.) to execute the dynamic programming algorithm in Stage 5 to compute a lower bound on the number of all chemical isomers $\mathbb{C}^{*}$ of $\mathbb{C}^{\dagger}$ and generate all (or up to 100) chemical isomers $\mathbb{C}^{*}$;

- $\mathbb{C}$-LB: a lower bound on the number of all chemical isomers $\mathbb{C}^{*}$ of $\mathbb{C}^{\dagger}$; and

- $\#\mathbb{C}$: the number of all (or up to 100) chemical isomers $\mathbb{C}^{*}$ of $\mathbb{C}^{\dagger}$ generated in Stage 5.

From Tables 2,3,4,5,6, we observe that for many cases the running time for generating up to 100 target chemical graphs in Stage 5 is less than 0.4 seconds. For some chemical graph $\mathbb{C}^{\dagger}$, no chemical isomer was found by our algorithm. This is because each acyclic chemical graph in the decomposition of $\mathbb{C}^{\dagger}$has no alternative acyclic chemical graph than the original one. On the other hand, some chemical graph $\mathbb{C}^{\dagger}$ such as the one in $I_{\mathrm{d}}$ in Table 2 admits an extremely large number of chemical isomers $\mathbb{C}^{*}$. Remember that we know such a lower bound $\mathbb{C}$-LB on the number of chemical isomers without generating all of them.

6. Conclusions

In this paper, we studied the problem of inferring chemical structures from desired chemical properties and constraints, based on the framework proposed and developed in [18, 19, 20]. In the previous applications of the framework of inferring chemical graphs, artificial neural network (ANN) and decision tree have been used for the machine learning of Stage 3. In this paper, we used linear regression in Stage 3 for the first time and derived an MILP formulation that simulates the computation process of linear regression. We also extended a way of specifying a target value $y^{*}$ in a property so that the predicted value $\eta(f(\mathbb{C}^{\dagger}))$ of a target chemical graph $\mathbb{C}^{\dagger}$ is required to belong to an interval between two specified values $\underline{y}^{*}$ and $\overline{y}^{*}$. Furthermore, we modified a model of chemical compounds so that multi-valence chemical elements, cation and anion are treated, and introduced the rank and the adjacency-configuration of leaf-edges as new descriptors in a feature vector of a chemical graph.

We implemented the new system of the framework and conducted computational experiments for Stages 1 to 5. We found 18 properties for which linear regression delivers a relatively good prediction function by using our feature vector based on the two-layered model. We also observed that an MILP formulation for inferring a chemical graph in Stage 4 can be solved efficiently over different types of test instances with complicated topological specifications. The experimental result suggests that our method can infer chemical graphs with up to 50 non-hydrogen atoms. Therefore, combination of linear regression and integer programming is a potentially useful approach to computational molecular design.

It is an interesting future work to use other learning methods such as graph convolution networks, random forest and an ensemble method to construct a prediction function and derive the corresponding MILP formulations in Stages 3 and 4 in the framework.

Author Contributions

Conceptialization, HN and TA; methodology, HN; software, JZ, NAA and KH; validation, JZ, NAA and HN; formal analysis, HN; data resources, KH, LZ, HN and TA; writing—original draft preparation, HN; writing—review and editing, NAA and TA; project administration, HN; funding acquisition, TA. All authors contributed to editorial changes in the manuscript. All authors read and approved the final manuscript.

Ethics Approval and Consent to Participate

Not applicable.

Acknowledgment

Not applicable.

Funding

This research was supported, in part, by Japan Society for the Promotion of Science, Japan, under Grant #18H04113.

Conflict of Interest

The authors declare no conflict of interest. TA is serving as the guest editor of this journal. We declare that TA had no involvement in the peer review of this article and has no access to information regarding its peer review. Full responsibility for the editorial process for this article was delegated to AK and GP.

Publisher’s Note: IMR Press stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Share