## Abstract

We introduce a novel method for Gaussian process (GP) modeling of massive datasets called globally approximate Gaussian process (GAGP). Unlike most large-scale supervised learners such as neural networks and trees, GAGP is easy to fit and can interpret the model behavior, making it particularly useful in engineering design with big data. The key idea of GAGP is to build a collection of independent GPs that use the same hyperparameters but randomly distribute the entire training dataset among themselves. This is based on our observation that the GP hyperparameter approximations change negligibly as the size of the training data exceeds a certain level, which can be estimated systematically. For inference, the predictions from all GPs in the collection are pooled, allowing the entire training dataset to be efficiently exploited for prediction. Through analytical examples, we demonstrate that GAGP achieves very high predictive power matching (and in some cases exceeding) that of state-of-the-art supervised learning methods. We illustrate the application of GAGP in engineering design with a problem on data-driven metamaterials, using it to link reduced-dimension geometrical descriptors of unit cells and their properties. Searching for new unit cell designs with desired properties is then achieved by employing GAGP in inverse optimization.

## 1 Introduction

Fueled by recent advancements in high-performance computing as well as data acquisition and storage capabilities (e.g., online repositories), data-driven methods are increasingly employed in engineering design [1–3] to efficiently explore the design space of complex systems by obviating the need for expensive experiments or simulations. For emerging material systems, in particular, large datasets have been successfully leveraged to design heterogeneous materials [4–8] and mechanical metamaterials [9–12].

Key to data-driven design is to develop supervised learners that can distill as much useful information from massive datasets as possible. However, most large-scale learners such as deep neural networks (NNs) [13] and gradient boosted trees (GBT) [14] are difficult to interpret and hence less suitable for engineering design. Gaussian process (GP) models (also known as Kriging) have many attractive features that underpin their widespread use in engineering design. For example, GPs interpolate the data, have a natural and intuitive mechanism to smooth the data to address noise (i.e., to avoid interpolation) [15], and are very interpretable (i.e., provide insight into input–output relations) [16,17]. In addition, they quantify prediction uncertainty and have analytical conditional distributions that enable, e.g., tractable adaptive sampling or Bayesian analysis [18]. However, conventional GPs are not readily applicable to large datasets and have been mostly confined to engineering design with small data. The goal of our work is to bridge the gap between big data and GPs while achieving high predictive accuracy.

The difficulty in fitting GPs to big data is rooted in the repetitive inversion of the sample correlation matrix, ** R**, whose size equals the number of training samples,

*n*. Given the practical features and popularity of GPs, considerable effort has been devoted to resolving their scalability shortcoming. One avenue of research has explored partitioning the input space (and hence the training data) via, e.g., trees [19] or Voronoi cells [20], and fitting an independent GP to each partition. While particularly useful for small to relatively large datasets that exhibit the nonstationary behavior, prediction using these methods results in discontinuity (at the partitions’ boundaries) and information loss (because the query point is associated with only one partition). Projected process approximation (PPA) [21] is another method where the information from

*n*samples is distilled into

*m*≪

*n*randomly (or sequentially) selected samples through conditional distributions. PPA is very sensitive to the

*m*selected samples, however, and overestimates the variance [21]. In Bayesian committee machine (BCM) [22], the dataset is partitioned into

*p*mutually exclusive and collectively exhaustive parts with independent GP priors, and then, the predictions from all the GPs are pooled together in a Bayesian setting. While theoretically very attractive, BCM does not scale well with the dataset size and is computationally very expensive.

Another avenue of research has pursued subset selection. For example, a simple strategy is to only use *m* ≪ *n* samples to train a GP [23,24], where the *m* samples are selected either randomly or sequentially based on maximizing some criteria such as information gain or differential entropy score. Reduced-rank approximation of ** R** with

*m*≪

*n*samples is another option for subset selection and has been used in the Nystrom [25] and subset of regressors [26,27] methods. The

*m*samples in these methods are chosen randomly or in a greedy fashion to minimize some cost function. While the many variants of subset selection may be useful in some applications, they waste information and are not applicable to very large datasets due to the computational and storage costs. Local methods also use subsets of the data because they fit a stationary GP (for each prediction) to a very small number of training data points that are closest to the query point. Locally approximate Gaussian process (LAGP) [28] is perhaps the most widely recognized local method where the subsets are selected either based on their proximity to the query point or to minimize the predictive variance. Despite being useful for nonstationary and relatively large datasets, local methods also waste some information and can be prohibitively expensive for repetitive use since local samples have to be found and a GP must be fitted for each prediction.

Although the recent works have made significant progress in bridging the gap between GPs and big data, GPs still struggle to achieve the accuracy of the state-of-the-art large-scale supervised learners such as NNs and trees. Motivated by this limitation, we develop a computationally stable and inexpensive approach for GP modeling of massive datasets. The main idea of our approach is to build a collection of independent GPs that utilize a converged roughness parameter as their hyperparameters. This is based on an empirical observation that the estimates of the GP hyperparameters change negligibly as the size of the training data exceeds a certain level. While having some common aspects with a few of the abovementioned works, our method is more massively scalable, can leverage multicore or graphical processing unit computations [29,30], and is applicable to very high-dimensional data with or without noise.

As mentioned earlier, big data have enticed new design methods for complex systems such as metamaterials [9–12], which possess superior properties through their hierarchical structure that consists of repeated unit cells. While traditional methods like topology optimization (TO) provide a systematic computational platform to find metamaterials with unprecedented properties, they have many challenges that are primarily due to the high-dimensional design space (i.e., the geometry of unit cells), computational costs, local optimality, and spatial discontinuities across unit cell boundaries (when multiple unit cells are simultaneously designed). Techniques for TO such as varying the volume fraction or size of one unit cell to maintain continuous boundaries [31,32], adding connectivity constraints [33], and substructuring [34] have recently been proposed but cannot fully address all of the above challenges. Instead, we take a data-driven approach by first building a large training database of many unit cells and their corresponding properties. Unlike previous data-driven works that represent unit cells as signed distance fields [9] or voxels [11], we drastically reduce the input dimension in our dataset by characterizing the unit cells via spectral shape descriptors based on the Laplace–Beltrami (LB) operator. Then, we employ our globally approximate Gaussian process (GAGP) modeling approach to link the LB descriptors of unit cells to their properties and, in turn, efficiently discover new unit cells with desired properties.

The rest of the paper is organized as follows. We first review some preliminaries on GP modeling in Sec. 2 and then introduce our novel idea in Sec. 3. In Sec. 4, we validate the accuracy of our approach by comparing its performance against three popular and large-scale supervised learning methods on five analytical problems. We demonstrate an application of GAGP to our data-driven design method for metamaterials in Sec. 5 and conclude the paper in Sec. 6.

## 2 Review on Gaussian Process Modeling

*y*and the

*d*dimensional vector

**= [**

*x**x*

_{(1)},

*x*

_{(2)}, …,

*x*

_{(d)}]

^{T}, where

**∈ ℝ**

*x*^{d}. Assume the input–output relation is a realization of the random process

*η*(

**):**

*x**f*

_{i}(

**)’s are some predetermined set of basis functions,**

*x***= [**

*β**β*

_{(1)}, ….,

*β*

_{(h)}]

^{T}are unknown weights, and

*ξ*(

**) is a zero-mean GP characterized with its parametric covariance function,**

*x**c*( · , · ):

*r*( · ) is the correlation function having the property

*r*(

**) = 1 and**

*x*,*x**σ*

^{2}is the process variance. Various correlation functions have been developed in the literature, with the most widely used one being the Gaussian correlation function:

**= [**

*ω**ω*

_{(1)},

*ω*

_{(2)}, …,

*ω*

_{(d)}]

^{T}, − ∞ <

*ω*

_{i}< ∞ are the roughness or scale parameters. The collection of

*σ*

^{2}and

**are called the hyperparameters.**

*ω*With the formulation in Eq. (1) and given the *n* training pairs of (*x*_{i}, *y*_{i}), GP modeling requires finding a point estimate for ** β**,

**, and**

*ω**σ*

^{2}via either maximum likelihood estimation (MLE) or cross-validation (CV). Alternatively, Bayes’ rule can be employed to find the posterior distributions if there is prior knowledge on these parameters. Herein, we use a constant process mean (i.e., $\u2211i=1h\u2061\beta ifi(x)=\beta $) and employ MLE. These choices are widely practiced because a high predictive power is provided while computational costs are minimized [28,35–39].

**1**is an

*n*× 1 vector of ones, and

**is the**

*R**n*×

*n*correlation matrix with (

*i*,

*j*)th element

*R*

_{ij}=

*r*(

*x*_{i},

*x*_{j}) for

*i*,

*j*= 1, …,

*n*. Setting the partial derivatives with respect to

*β*and

*σ*

^{2}to zero yields:

By numerically minimizing *L* in Eq. (7), one can find $\omega ^$. Many global optimization methods such as genetic algorithm (GA) [40], pattern searches [41,42], and particle swarm optimization [43] have been employed to solve for $\omega ^$ in Eq. (7). However, gradient-based optimization techniques are commonly preferred due to their ease of implementation and superior computational efficiency [15,16,35]. To guarantee global optimality in this case, the optimization is done numerous times with different initial guesses. It is noted that, in practice, the search space of *ω*_{i} is generally limited to [− 20, 5] rather than ( − ∞, ∞) since the correlation exponentially changes as a function of *ω*_{i} (see also Fig. 3).

*x*^{*}:

*n*× 1 vector with

*i*th element $c(xi,x*)=\sigma ^2r(xi,x*)$,

**is the covariance matrix with (**

*V**i*,

*j*)th element $\sigma ^2r(xi,xj)$, and

**= [**

*y**y*

_{1}, …,

*y*

_{n}]

^{T}are the responses in the training dataset. The posterior covariance between the responses at $x*$ and

**′ reads:**

*x***= (**

*h***1**−

**1**

^{T}

*V*^{−1}

**(**

*g***′)).**

*x**q*responses by placing a constant mean for each response (i.e.,

**= [**

*β**β*

_{(1)}, ….,

*β*

_{(q)}]

^{T}) and employing the separable covariance function:

*q*×

*q*process covariance matrix with its off-diagonal elements representing the covariance between the corresponding responses at any fixed

**. The MLE approach described above can also be applied to multiresponse datasets in which case**

*x**σ*

^{2}will be replaced with $\Sigma $ (see Refs. [45–48] for details).

Finally, we note that GPs can address noise and smooth the data (i.e., avoid interpolation) via the so-called nugget or jitter parameter, *δ*, in which case ** R** is replaced with $R\delta =R+\delta In\xd7n$. If

*δ*is used, the estimated (stationary) noise variance in the data would be $\delta \sigma ^2$. We have recently developed an automatic method to robustly detect and estimate noise [35].

## 3 Globally Approximate Gaussian Process

Regardless of the optimization method used to solve for $\omega ^$, each evaluation of *L* in Eq. (7) requires inverting the *n* × *n* matrix ** R**. For very large

*n*, there are two main challenges associated with this inversion: computational cost of approximately

*O*(

*αn*

^{3}) and singularity of

**(since the samples get closer as**

*R**n*increases). To address these issues and enable GP modeling of big data, our essential idea is to build a collection of independent GPs that use the same $\omega ^$ and share the training data among themselves.

To illustrate, we consider the function *y* = *x*^{4} − *x*^{3} − 7*x*^{2} + 3*x* + 5sin(5*x*) over −2 ≤ *x* ≤ 3. The associated likelihood profile (i.e., *L*) is visualized in Fig. 1 as a function of *ω* for various values of *n*. Two interesting phenomena are observed in this figure: (i) With large *n*, the profile of *L* does not alter as the training samples change. To observe this, for each *n*, we generate five independent training samples via Sobol sequence [49,50] and plot the corresponding *L*. As illustrated in Fig. 1, even though a total of 20 curves are plotted, only four are visible since the five curves with the same *n* are indistinguishable. (ii) As *n* increases, *L* is minimized at similar *ω*’s.

While we visualize the above two points with a simple 1D function, our studies indicate that they hold in general (i.e., irrespective of problem dimensionality and the absence or presence of noise; see Sec. 4) as long as the number of training samples is large. Therefore, we propose the following approach for GP modeling of large datasets.

Assuming a very large training dataset of size *n* is available, we first randomly select a relatively small subset of size *n*_{0} (e.g., *n*_{0} = 500) and estimate $\omega ^0$ with a gradient-based optimization technique. Then, we add *n*_{s} random samples (e.g., *n*_{s} = 250) to this subset and estimate $\omega ^1$ while employing $\omega ^0$ as the initial guess in the optimization. This process is stopped after *s* steps when $\omega ^$ does not change noticeably (i.e., $\omega ^s\u2245\omega ^s\u22121$) as more training data are used. The latest solution, denoted by $\omega ^\u221e$, is then employed to build *m* GP models, each with *n*_{k} ≥ *n*_{0} + *s* × *n*_{s} samples chosen randomly from the entire training data such that $n=\u2211k=1m\u2061nk$. Here, we have assumed that the collection of these GPs (who have $\omega ^\u221e$ as their hyperparameters) approximate a GP that is fitted to the entire training dataset and, correspondingly, call it GAGP. The algorithm of GAGP is presented in Fig. 2.

We point out the following important features regarding GAGP. First, we recommend using gradient-based optimizations throughout the entire process because (i) if *n*_{0} is large enough (e.g., *n*_{0} > 500), one would need to select only a few initial guesses to find the global minimizer of Eq. (7), i.e., $\omega ^0$ (we suggest the method developed in Ref. [35] to estimate $\omega ^0$); and (ii) we want to use $\omega ^i\u22121$ as the initial guess for the optimization in the *i*th step to ensure fast convergence since the minimizer of *L* changes slightly as the dataset size increases (see Fig. 1). Regarding the choice on *n*_{0}, note that it has to be small enough to avoid prohibitive computational time but large enough, so that (i) the global optimum changes slightly if *n*_{0} + *s* data points are used instead of *n*_{0} data points, and (ii) most (if not all) of the local optima of *L* are smoothed out. Second, for predicting the response, Eq. (8) is used for each of the *m* GP models, and then, the results are averaged. In our experience, we observe very similar prediction results with different averaging (e.g., weighted averaging where the weights are proportional to inverse variance) or pooling (e.g., median) schemes. The best scheme for a particular problem can be found via CV, but we avoid this step to ensure ease of use and generality. The advantages of employing a collection of models (in our case the *m* GPs) in prediction are extensively demonstrated in the literature [14,22]. Third, the predictive power is not sensitive to *n*_{0}, *s*, and *n*_{s} so long as large enough values are used for them. For novice users, we recommend starting with *n*_{0} = 500, *s* = 6, and *n*_{s} = 250, and equally distributing the samples among the *m* resulting GPs (we use these parameters in Sec. 5 and for all the examples in Sec. 4). For more experienced users, we provide a systematic way in Sec. 4 to choose these values based on GP’s inherent ability to estimate noise using the nugget variance. Finally, we point out that GAGP has a high predictive power and is applicable to very large datasets while maintaining a straightforward implementation because it only entails integrating a GP modeling package such as GPM [35] with the algorithm presented in Fig. 2.

## 4 Comparative Studies on Analytical Examples

*x*) term. Ex2 is a 4D function that is primarily sensitive to the first two inputs. Ex3 is a 7D function that models the cycle time of a piston, which mainly depends on the first and fourth inputs. Ex4 is a very complex 10D function relating the stored potential in a material to the applied load and material properties; the inputs interact nonlinearly and all affect the response. Finally, Ex5 is a 20D function where the output is independent of

*x*

_{8}and

*x*

_{16}.

- Ex1:(11)$y=x4\u2212x3\u22127x2+3x+5sin\u2061(5x)\u22122\u2264x\u22643$
- Ex2 [52]:(12)$y=(1\u2212exp\u2061(\u221212x2))(x3x13+1900x12+2092x1+60)x4x13+500x12+4x1+20min(x)=[0,0,2200,85]max(x)=[1,1,2400,110]$
- Ex3 [53]:(13)$y=2\pi x1x4+4x3x4x5x6(x5x2+19.62x1\u2212x4x3x2)2+4x4x5(x6x7)x7min(x)=[30,0.005,0.002,1000,9\xd7104,290,340]max(x)=[60,0.02,0.01,5000,11\xd7104,296,360]$
- Ex4 [54]:(14)$y=92x9\epsilon m2+x8x101+x7[\epsilon eqx10]1+x7\epsilon =(x1x6x5x6x2x4x5x4x3),\epsilon m=13Tr(\epsilon ),\epsilon d=\epsilon \u2212\epsilon m1,\epsilon eq=23(\epsilon d:\epsilon d)min(x)=[\u22120.1,\u22120.1,\u22120.1,\u22120.1,\u22120.1,\u22120.1,0.02,1,5,0.1]max(x)=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,5,25,1.5]$

For each example, two independent and unique datasets of size 30,000 are generated with Sobol sequence [50], where the first one is used for training and the second for validation. In each example, Gaussian noise is added to both the training and validation outputs. We consider two noise levels to test the sensitivity of the results where the noise standard deviation (SD) is determined based on each example’s output range (e.g., the outputs in Ex1 and Ex4 fall in the [−20, 5] and [0, 1.8] ranges, respectively). As we measure performance by root mean squared error (RMSE), the noise SD should be recovered on the validation dataset (i.e., the RMSE would ideally equal noise SD).

We use CV to ensure the best performance is achieved for LAGP, GBT, and NN. For GAGP, we choose *n*_{0} = 500, *s* = 6, and *n*_{s} = 250 and equally distribute the samples among the $m=30000/(500+6\xd7250)=15$ GPs (i.e., each GP has 2000 samples). The results are summarized in Table 1 (for small noise SD) and Table 2 (for large noise SD) and indicate that (i) GAGP consistently outperforms LAGP and GBT, (ii) both GAGP and NN recover the true amount of added noise with high accuracy, and (iii) GAGP achieves very similar results to NN. Given the large number of data points, the effect of sample-to-sample randomness on the results is very small and hence not reported.

Noise SD | LAGP | GBT | NN | GAGP | |
---|---|---|---|---|---|

Ex1 (1D) | 0.2 | 1.271 | 0.209 | 0.200 | 0.200 |

Ex2 (4D) | 0.1 | 1.386 | 0.121 | 0.100 | 0.103 |

Ex3 (7D) | 0.1 | 0.129 | 0.118 | 0.100 | 0.100 |

Ex4 (10D) | 0.01 | 0.210 | 0.048 | 0.012 | 0.011 |

Ex5 (20D) | 0.1 | 1.450 | 0.351 | 0.101 | 0.103 |

Noise SD | LAGP | GBT | NN | GAGP | |
---|---|---|---|---|---|

Ex1 (1D) | 0.2 | 1.271 | 0.209 | 0.200 | 0.200 |

Ex2 (4D) | 0.1 | 1.386 | 0.121 | 0.100 | 0.103 |

Ex3 (7D) | 0.1 | 0.129 | 0.118 | 0.100 | 0.100 |

Ex4 (10D) | 0.01 | 0.210 | 0.048 | 0.012 | 0.011 |

Ex5 (20D) | 0.1 | 1.450 | 0.351 | 0.101 | 0.103 |

Note: Smallest errors are in bold.

Noise SD | LAGP | GBT | NN | GAGP | |
---|---|---|---|---|---|

Ex1 (1D) | 2 | 2.270 | 2.062 | 2.000 | 2.000 |

Ex2 (4D) | 1 | 1.739 | 1.123 | 1.002 | 1.009 |

Ex3 (7D) | 1 | 1.037 | 1.098 | 1.002 | 1.002 |

Ex4 (10D) | 0.1 | 0.234 | 0.120 | 0.102 | 0.102 |

Ex4 (20D) | 1 | 1.911 | 1.155 | 1.011 | 1.001 |

Noise SD | LAGP | GBT | NN | GAGP | |
---|---|---|---|---|---|

Ex1 (1D) | 2 | 2.270 | 2.062 | 2.000 | 2.000 |

Ex2 (4D) | 1 | 1.739 | 1.123 | 1.002 | 1.009 |

Ex3 (7D) | 1 | 1.037 | 1.098 | 1.002 | 1.002 |

Ex4 (10D) | 0.1 | 0.234 | 0.120 | 0.102 | 0.102 |

Ex4 (20D) | 1 | 1.911 | 1.155 | 1.011 | 1.001 |

Note: Smallest errors are in bold.

We highlight that the performance of GAGP in each case could have been improved even further by tuning its parameters via CV (which was done for LAGP, GBT, and NN). Potential parameters include *n*_{0}, *s*, *n*_{s}, and *f*_{i}(** x**). However, we

*intentionally*avoid this tuning to demonstrate GAGP’s flexibility, generality, and ease of use.

In engineering design, it is highly desirable to employ interpretable methods and tools that facilitate the knowledge discovery and decision-making processes. Contrary to many supervised learning techniques such as NNs and random forests that are black boxes, the structure of GPs can provide qualitative insights. To demonstrate, we rewrite Eq. (3) as $r(x,x\u2032)=exp\u2061{\u2212\u2211i=1d\u206110\omega i(x(i)\u2212x(i)\u2032)2}$. If *ω*_{i} ≪ 0 (e.g., *ω*_{i} = −10), then variations along the *i*th dimension (i.e., *x*_{(i)}) do not contribute to the summation and, subsequently, to the correlation between ** x** and

**′ (see Fig. 3 for a 1D illustration). This contribution increases as the magnitude of**

*x**ω*

_{i}increases. In a GP with a constant mean of

*β*, all the effect of inputs on the output is captured through

*r*(

**,**

*x***′). Hence, as**

*x**ω*

_{i}decreases, the effect of

*x*

_{i}on the output decreases as well. We illustrate this feature with a 2D example as follows. Assume $y=f(x1,x2;\alpha )=sin\u2061(2x1x2)+\alpha cos\u2061(\alpha x12)$, −

*π*≤

*x*

_{1},

*x*

_{2}≤

*π*for

*α*= 2, 4, 6. Three points regarding

*f*are highlighted:

*x*_{1}is more important than*x*_{2}since both sin(2*x*_{1}*x*_{2}) and $\alpha cos\u2061(\alpha x12)$ depend on*x*_{1}(note that*α*≠ 0), while*x*_{2}only affects the first term.As

*α*increases, the relative importance of*x*_{1}(compared with*x*_{2}) increases because the amplitude of $\alpha cos\u2061(\alpha x12)$ increases.As

*α*increases,*y*depends on*x*_{1}with growing nonlinearity because the frequency of $\alpha cos\u2061(\alpha x12)$ increases.

The first two points can be verified by calculating Sobol’s total sensitivity indices (SIs) for *x*_{1} and *x*_{2} in *f*; see Table 3. These indices range from 0 to 1, with higher values indicating more sensitivity to the input. Here, the SI of *x*_{1} is always 1, but the SI of *x*_{2} decreases as *α* increases, indicating that the *relative* importance of *x*_{1} on *y* increases as *α* increases.

α = 2 | α = 4 | α = 6 | ||
---|---|---|---|---|

n = 1000 | $\omega ^1$ | 2.39 | 3.11 | 3.59 |

$\omega ^2$ | − 1.98 | − 2.12 | − 2.11 | |

n = 2000 | $\omega ^1$ | 2.38 | 3.10 | 3.54 |

$\omega ^2$ | − 1.92 | − 2.18 | − 2.22 | |

Total SI | x_{1} | 1.00 | 1.00 | 1.00 |

x_{2} | 0.18 | 0.05 | 0.03 |

α = 2 | α = 4 | α = 6 | ||
---|---|---|---|---|

n = 1000 | $\omega ^1$ | 2.39 | 3.11 | 3.59 |

$\omega ^2$ | − 1.98 | − 2.12 | − 2.11 | |

n = 2000 | $\omega ^1$ | 2.38 | 3.10 | 3.54 |

$\omega ^2$ | − 1.92 | − 2.18 | − 2.22 | |

Total SI | x_{1} | 1.00 | 1.00 | 1.00 |

x_{2} | 0.18 | 0.05 | 0.03 |

Note: The Sobol’s total sensitivity indices (SIs) are also included.

Note that calculating the Sobol’s SIs involves evaluating *f* for hundreds of thousands of samples, while a GP can distill similar sensitivities from a dataset. To show this, for each *α*, we fit two GPs: one with *n* = 1000 training data and the other with *n* = 2000. The hyperparameter estimates are summarized in Table 3 and indicate that:

For each

*α*, $\omega ^1$ is larger than $\omega ^2$, implying*x*_{1}is more important than*x*_{2}.As

*α*increases, $\omega ^1$ increases (*x*_{1}becomes more important), while $\omega ^2$ changes negligibly (the underlying functional relation between*x*_{2}and*y*does not depend on*α*).For a given

*α*, the estimates change insignificantly when*n*is increased.

The above feature is present in GAGP as well and depicted by the convergence histories for Ex3 and Ex5 in Figs. 4 and 5, respectively. Similar to Fig. 1, it is evident that the estimated roughness parameters do not change noticeably as more samples are used in training (only 6 of the 20 roughness parameters are plotted in Fig. 5 for a clearer illustration). The values of these parameters can determine which inputs (and to what extent) affect the output. For instance, in Ex5, *ω*_{8} is very small so the output must be almost insensitive to *x*_{8}. In addition, since *ω*_{4} ≅ *ω*_{20}, it is expected that the corresponding inputs should affect *y* similarly. These observations agree with the analytical relation between ** x** and

*y*in Ex5, where

*y*is independent of

*x*

_{8}and symmetric with respect to

*x*

_{4}and

*x*

_{20}. By using GAGP, such information can also be extracted from a training dataset whose underlying functional relation is unknown and subsequently used for sensitivity analysis or dimensionality reduction (e.g., in Ex5,

*x*

_{8}and

*x*

_{16}can be excluded from the training data).

In Figs. 4 and 5, the estimated variance, $\delta \sigma ^2$, varies closely around the true noise variance. It provides a useful quantitative measure for the expected predictive power (e.g., RMSE in future uses of the model). In addition, like $\omega ^$, its convergence history helps in determining whether sufficient samples have been used in training. First, the number of training samples should be increased until $\delta \sigma ^2$ does not fluctuate noticeably. Second, via *k*-fold CV during training, the true noise variance should ideally be recovered by calculating the RMSE associated with predicting the samples in the *i*th fold (when fold *i* is not used in training). If these two values differ significantly, *s* (or *n*_{s}) should be increased. For instance, if the fluctuations on the right panel in Fig. 5 had been large or far from the noise variance, we would have increased *s* (from 6 to, e.g., 10) or *n*_{s} (from 250 to, e.g., 500).

We close this section with some theoretical discussions related to the use of the same hyperparameters within a collection of independent GPs. In Bayesian experimental design [55], multidimensional integrals ubiquitously arise when maximizing the expected information gain, i.e., *E*[*I*]. By quantifying *I* using Kullback–Leibler divergence [56], it can be shown [57] that $E[I]=\u222b(\u222blog\u2061(p(\theta |yi)p(\theta ))p(\theta |yi)d\theta )p(yi)dyi$, where ** θ** are the hyperparameters (to be estimated),

*y*_{i}are the observables in the

*i*th experiment, and

*p*( · ) is the probability density function. The nested integral renders maximizing

*E*[

*I*] prohibitively expensive. To address this and facilitate integration, Laplace’s theorem is used to approximate

*p*(

**|**

*θ*

*y*_{i}) via a multivariate Gaussian likelihood or log likelihood, such as in Eq. (4). Following the central limit theorem, the accuracy of this approximation increases as the number of data points increases since the likelihood would more closely resemble a unimodal multivariate Gaussian curve [58]. With GAGP, we essentially make the same approximation, i.e., Eq. (4) approximates a unimodal multivariate Gaussian curve in the log space whose minimizer insignificantly changes when the training data are massive (note that the function value,

*L*, does change; see Fig. 1).

## 5 Data-Driven Design of Metamaterials

To demonstrate the application of GAGP in engineering design, we employ it in a new data-driven method for the optimization of metamaterial unit cells using big data. Although various methods, e.g., TO and GA, have been applied to design metamaterials with prescribed properties, these are computationally intensive and suffer from sensitivity to the initial guess as well as disconnected boundaries when using multiple unit cells. A promising solution is to construct a large database of precomputed unit cells (also known as microstructures or building blocks) for efficient selection of well-connected unit cells from the database as well as inexpensive optimization of new unit cells [9–12]. However, with the exception of Ref. [12] where unit cells are parameterized via geometric features like beam thickness, research in this area thus far use high-dimensional geometric representations (e.g., signed distance functions [9] or voxels [11]) that increase the memory demand and the complexity of constructing machine learning models that link structures to properties. Therefore, reducing the dimension of the unit cell is a crucial step.

In this work, we reduce the dimension of the unit cells in our metamaterial database with spectral shape descriptors based on the LB operator. We then employ GAGP to learn how the effective stiffness tensor of unit cells changes as a function of their LB descriptors. After the GAGP model is fitted, we use it to discover unit cells with desired properties through inverse optimization. Furthermore, to present the advantages of a large unit cell database and GAGP, we compare the results with those obtained using an NN model fitted to the same dataset and a conventional GP model fitted to a smaller dataset.

### 5.1 Metamaterials Database Generation.

We propose a novel two-stage pipeline inspired by Ref. [11] to generate a large training dataset of unit cells and their corresponding homogenized properties. For demonstration, our primary properties of interest are the components of the stiffness tensor, *E*_{x}, *E*_{y}, and *E*_{xy}. As elaborated below, our method starts by building an initial dataset and then proceeds to efficiently cover the input (geometry) and output (property) spaces as widely as possible.

To construct the initial dataset in stage one, we select design targets in the property space (the 3D space spanned by *E*_{x}, *E*_{y}, and *E*_{xy}). As the bounds of the property space are unknown a priori, we sample 1000 points uniformly distributed in [0, 1]^{3}. Then, we use the solid isotropic material with penalization TO method [59] to find the orthotropic unit cells corresponding to each target. This stage generates 358 valid structures. The remaining 642 points do not result in feasible unit cells mainly because (i) the uniform sampling places some design targets in theoretically infeasible regions and (ii) the TO method may fail to meet targets due to sensitivity to the initial shape, which is difficult to guess without prior knowledge. The properties of these 358 structures are shown in Fig. 6, where the Poisson’s ratio is used instead of *E*_{xy} for a better illustration of the space.

*x*

_{new}and

*x*

_{old}are the coordinates of the new and original pixel locations,

*x*

_{c}is the coordinate vector of the distortion center,

*r*

_{new}and

*r*

_{old}are the new and original distances to the distortion center, respectively, and

*R*

_{0}is the outer distortion radius.

*r*

_{new}can be expressed as follows:

*R*

_{0},

*γ*, and

*x*

_{c}) have clear interpretations and hence can be easily tuned. In our case, they are all set as random variables with standard uniform distributions to generate a wide range of structures. Second, it preserves the topology of the original unit cell and introduces negligible artifacts (e.g., disconnections and checkerboard patterns) upon perturbation.

*d*is the Euclidean (L2) distance between the stiffness tensor components of each unit cell to the boundaries of the region enclosing the current property space (see the shaded areas in Figs. 6 and 7),

*ρ*is the number of data points inside the neighborhood within a given radius in the property space (in our experience, sampling is more uniform when the radius equals 0.05), and

*ɛ*≪ 1 is used to avoid singularity. Then, we select the

*N*points with the highest scores for stochastic perturbation. After each iteration, newly generated unit cells are verified and discarded if they contain infeasible features such as isolated parts. The properties of new feasible structures are then calculated via numerical homogenization [60] and added to the dataset. The perturbation is repeated until the boundary of property space does not expand significantly (Δ

*d*< 0.1), and the points inside the boundaries are relatively dense (average

*ρ*> 500). By using this strategy, the database is expanded in our test case from 358 to 88,000 unit cells that cover a wider range of properties (see Fig. 7). We note that the machine learner (GAGP in our case) may also be used to detect the infeasible structures; we do not consider this in the current work but may include it in our future studies.

### 5.2 Unit Cell Dimension Reduction via Spectral Shape Descriptors.

In the previous section, each unit cell in the database is represented by 50 × 50 pixels. For dimension reduction, we use spectral shape descriptors as they retain geometric and physical information. Specifically, we use the LB spectrum, also known as Shape-DNA, which can be directly calculated for any unit cell shape [61,62].

The LB spectrum is an effective descriptor for the metamaterials database for several reasons: (i) It has a powerful discrimination ability and has been successfully applied to shape matching and classification in computer vision despite being one of the simplest spectral descriptors. (ii) All of the complex structures in our orthotropic metamaterials database can be uniquely characterized with the first 10–15 eigenvalues in the LB spectrum. (iii) The spectrum embodies some geometrical information, including perimeter, area, and Euler number. This can be beneficial for the construction of the machine learning model as less training data may be required to obtain an accurate model compared with voxel- or point-based representations. (iv) Similar shapes have close LB spectrum, which may also help the supervised learning task.

*f*defined on a Riemannian manifold [61], the Helmholtz equation reads as follows:

*τ*are the interior and boundaries of the domain of interest, respectively.

Finally, the finite element method is employed to obtain the LB spectrum of unit cells [63]; see Fig. 8. It is noted that our 88,000 structures can be uniquely determined with only the first 16 non-zero eigenvalues, reducing the input dimension from 50 × 50 = 2500 pixels to 16 scalar descriptors. In general, the computation of the LB spectrum takes only a few seconds per unit cell on a single CPU (Intel(R) Xeon(R) Gold 6144 CPU at 3.50 GHz). Since these computations are performed once and in parallel, the runtime is acceptable.

### 5.3 Machine Learning: Linking LB Representation to Property via GAGP.

Once the dataset is built, we follow the algorithm in Fig. 2 for machine learning, i.e., relating the LB representations of unit cells to their stiffness tensor. We use the same fitting parameters as in Sec. 4 (*n*_{0} = 500, *s* = 6, *n*_{s} = 250), equally distribute the samples among the $m=88000/(500+6\xd7250)=44$ GPs, and use Eq. (10) to have a multiresponse model that leverages the correlation between the responses to have a higher predictive power. The convergence histories are provided in Fig. 9, where the trends are consistent with those in Sec. 4. It is observed that the 16 estimated roughness parameters do not change noticeably once more than 1000 samples are used in training. In particular, 3 of the 16 roughness estimates, which correspond to *λ*_{14}, *λ*_{15}, and *λ*_{16} are very small, indicating that those LB descriptors do not affect the responses. The next largest estimate belongs to *ω*_{13} ≅ −8, which corresponds to *λ*_{13}. The rest of the estimates are all between 2.5 and 3, implying that the first 12 eigenvalues (shape descriptors) affect the responses similarly and nonlinearly (since large *ω*_{i} indicates rough response changes along dimension *i*). These observations agree well with the fact that the higher order eigenvalues generally explain less variability in the data. The estimated noise variances (one per response) also converge, with *E*_{xy} having the largest estimated noise variance in the data, which is potentially due to larger numerical errors in property estimation.

To illustrate the effect of expanding the training data from 385 to 88,000, we randomly select 28,000 samples for validation. Then, we evaluate the mean squared error (MSE) of the following two models on this test set: a conventional GP fitted to the initial 385 samples and a GAGP fitted to the rest of the data (i.e., to 60,000 samples, resulting in $m=60,000/(500+6\xd7250)=30$ models). To account for randomness, we repeat this process 20 times. The results are summarized in Table 4 and demonstrate that (i) increasing the dataset size (stage two in Sec. 5.1) creates a supervised learner with a higher predictive power (compare the mean of MSEs for GP and GAGP). (ii) GAGP is more robust to variations than GP (compare the variance of MSEs for GP and GAGP). (iii) With 60,000samples, the predictive power of GAGP is slightly lower than the case where the entire dataset is used in training (compare mean of MSEs for GAGP in Table 4 with the converged noise estimates in Fig. 9).

Mean of MSE | Variance of MSE (× 10^{6}) | |||||
---|---|---|---|---|---|---|

E_{x} | E_{y} | E_{xy} | E_{x} | E_{y} | E_{xy} | |

GP | 0.048 | 0.007 | 0.028 | 39 | 5.5 | 0.45 |

GAGP | 0.008 | 0.001 | 0.011 | 0.12 | 0.0007 | 0.04 |

Mean of MSE | Variance of MSE (× 10^{6}) | |||||
---|---|---|---|---|---|---|

E_{x} | E_{y} | E_{xy} | E_{x} | E_{y} | E_{xy} | |

GP | 0.048 | 0.007 | 0.028 | 39 | 5.5 | 0.45 |

GAGP | 0.008 | 0.001 | 0.011 | 0.12 | 0.0007 | 0.04 |

Note: The mean and variance of MSE are calculated over 20 random repetitions.

To assess the robustness of GAGP to data size, we repeat the above procedure but with 20,000 and 40,000 training samples instead of 60,000. For fair comparisons, the same validation sample size of 28,000 is used for each. The results are summarized in Table 5 and, by comparing them with those of GAGP in Table 4, indicate that increasing the sample size from 20,000 to 60,000 increases the predictive power and robustness. Note that, since [*n*_{0}, *n*_{s}, *s*] are not changed when fitting GAGP, using more samples increases *m*, the number of GPs.

Sample size | Mean of MSE | Variance of MSE (× 10^{6}) | ||||
---|---|---|---|---|---|---|

E_{x} | E_{y} | E_{xy} | E_{x} | E_{y} | E_{xy} | |

20,000 | 0.0090 | 0.0024 | 0.017 | 0.30 | 0.004 | 0.08 |

40,000 | 0.0084 | 0.0016 | 0.013 | 0.16 | 0.001 | 0.05 |

Sample size | Mean of MSE | Variance of MSE (× 10^{6}) | ||||
---|---|---|---|---|---|---|

E_{x} | E_{y} | E_{xy} | E_{x} | E_{y} | E_{xy} | |

20,000 | 0.0090 | 0.0024 | 0.017 | 0.30 | 0.004 | 0.08 |

40,000 | 0.0084 | 0.0016 | 0.013 | 0.16 | 0.001 | 0.05 |

Note: MSE errors (and their variances) are obtained via 28,000 random samples in 20 repetitions.

### 5.4 Data-Driven Unit Cell Optimization.

Finally, we illustrate the benefits of the GAGP model in an inverse optimization scheme to realize unit cells with target stiffness tensor components and compare the results with those designed using other techniques. Establishing such an inverse link is highly desirable in structure design as it allows to efficiently achieve target elastic properties by avoiding expensive finite element simulations and tedious trial and error in TO. In addition, although not demonstrated in this work, such a link can provide multiple candidate unit cells with the same properties that, in turn, enable tiling different unit cells into a macrostructure while ensuring boundary compatibility.

*E*^{t}and

*E*^{p}are the vectorized forms of, respectively, the target and predicted stiffness tensors.

**= [**

*λ**λ*

_{1}, …,

*λ*

_{16}] and

*λ*^{0}are the LB spectra of, respectively, candidate unit cell for the target stiffness tensor and the unit cell closest to the prescribed properties in the property space (

*λ*

_{i}is the

*i*th order eigenvalue). We choose GA for optimization since the GAGP model is cheap to run and GA ensures global optimality for multivariate and constrained problems. Note that, to ensure fast convergence during GA, we limit the search space for GA using the LB spectrum of the unit cell in the training dataset whose properties are closest to

*E*^{t}.

After obtaining the optimal LB spectrum, we use a level set method to reconstruct the corresponding unit cell [64] while employing squared residuals of the LB spectrum as the objective function. For faster convergence, the unit cell closest to the optimal LB spectrum in the spectrum space is taken as the initial guess in the reconstruction process.

In the following two examples, the goal is to design structures with desired *E*_{x}, *E*_{y}, and *E*_{xy} (see the target properties in Fig. 10). For each example, three unit cells are designed using different models: GAGP, NN, and GP (GAGP and NN use the entire dataset while GP uses the initial one with 358 structures). The results are visualized in Fig. 10 and demonstrate that the unit cells identified from GAGP and NN are more geometrically diverse than those obtained via GP. This is a direct result of populating the large dataset with perturbed structures and, in turn, providing the GA search process with a wider range of initial seeds. While we utilized our entire database for GAGP and NN in an attempt to provide more diversity for new designs, a smaller or less diverse training dataset could potentially achieve similar results; such a study is left for future work. We also note that the unit cells designed with GP are similar in shape but different in the size of the center hole, which leads to the significant change in properties.

From a quantitative point of view, our data-driven design method with the large database can, compared with the small dataset case, discover unit cells with properties that are closer to the target values. For instance, in Ex1, the GAGP and NN results using the large dataset achieve the target *E*_{x}, whereas the GP result with the small dataset differs from the target by around $12%$. Ex2 shows a similar pattern, with the GAGP, NN, and GP results differing from the target *E*_{x} by $4%$, $10%$, and $16%$, respectively. When the small dataset is used, the greater deviations from the target properties can be mainly attributed to insufficient training samples and the relatively small search space. This reinforces the need for a large database of unit cells in the data-driven design of metamaterials, along with an expedient machine learning method for big data. Moreover, unit cells designed with GAGP have smaller deviations than those with NN.

## 6 Conclusion and Future Works

In this work, we proposed a novel approach, GAGP, to enable GP modeling of massive datasets. Our method is centered on the observation that the hyperparameter estimates of a GP converge to some limit values, $\omega ^\u221e$, as more training samples are added. We introduced an intuitive and straightforward method to find $\omega ^\u221e$ and, subsequently, build a collection of independent GPs that all use the converged $\omega ^\u221e$ as their hyperparameters. These GPs randomly distribute the entire training dataset among themselves, which allows inference based on the entire dataset by pooling the predictions from the individual GPs. The training cost of GAGP primarily depends on the initial optimization with *n*_{s} data points and the *s* optimizations thereafter. The former cost is the same as fitting a conventional GP with *n*_{s} samples. The latter is an additional cost but is generally manageable since a single initial guess close to the global optimum is available at each iteration. The cost of building *m* GPs is negligible compared with these optimization costs. The prediction cost, although being *m* times larger than a conventional GP, is small enough for practical applications.

With analytical examples, we demonstrated that GAGP achieves very high predictive power that matches, and in some cases exceeds, that of state-of-the-art machine learning methods such as NNs and boosted trees. Unlike these latter methods, GAGP is easy to fit and interpret, which makes it particularly useful in engineering design with big data. Although the predictive power of GAGP increases as the size of the training data increases, so does the cost of fitting and training; it may be necessary to choose part of the data if resources are limited. We also note that, throughout, we assumed that the training samples are not ordered or highly correlated. If they are, randomization and appropriate transformations are required. In addition, we assumed stationary noise with an unknown variance. Considering nonstationary noise variance would be an interesting and useful extension for GAGP. Thrifty sample selection for model refinement (instead of randomly taking subsets of training data) can also improve the predictive power of GAGP and is planned for our future works.

As a case study, we applied GAGP to a data-driven metamaterials unit cell design process that attains desired elastic properties by transforming the complex material design problem into a parametric one using spectral descriptors. After mapping reduced-dimensional geometric descriptors (LB spectrum) to properties through GAGP, unit cells with properties close to the target values are discovered by finding the optimal LB spectrum with inverse optimization. This framework provides a springboard for a salient new approach to systematically and efficiently design metamaterials with optimized boundary compatibility, spatially varying properties, and multiple functionalities.

## Acknowledgment

The authors are grateful to Professor K. Svanberg from the Royal Institute of Technology, Sweden, for providing a copy of the MMA code for metamaterial design. Support from the National Science Foundation (NSF) (Grant Nos. ACI 1640840 and OAC 1835782; Funder ID: 10.13039/501100008982) and the Air Force Office of Scientific Research (AFOSR FA9550-18-1-0381; Funder ID: 10.13039/100000181) are greatly appreciated. Ms. Yu-Chin Chan would like to acknowledge the NSF Graduate Research Fellowship Program (Grant No. DGE-1842165).

## Nomenclature

*d*=input dimensionality

*n*=number of training samples

*q*=output dimensionality

*s*=number of times that

*n*_{s}samples are added to*n*_{0}=*x*vector of

*d*inputs=*y*vector of

*q*outputs*L*=objective function in MLE

=*R*sample correlation matrix of size

*n*×*n*- GP =
Gaussian process

- MLE =
maximum likelihood estimation

*δ*=Nugget or jitter parameter

*n*_{0}=number of initial random samples

*n*_{s}=number of random samples added to

*n*_{0}per iteration=*ω*roughness parameters of the correlation function

- $\omega ^\u221e$ =
estimate of

via MLE with very large training data*ω*

## References

*Information Theory and Statistics*,