The representation of high-dimensional potential energy surfaces by way of the many-body expansion and permutationally invariant polynomials has become a well-established tool for improving the resolution and extending the scope of molecular simulations. The high level of accuracy that can be attained by these potential energy functions (PEFs) is due in large part to their specificity: for each term in the many-body expansion, a species-specific training set must be generated at the desired level of theory and a number of fits attempted in order to obtain a robust and reliable PEF. In this work, we attempt to characterize the numerical aspects of the fitting problem, addressing questions which are of simultaneous practical and fundamental importance. These include concrete illustrations of the nonconvexity of the problem, the ill-conditionedness of the linear system to be solved and possible need for regularization, the sensitivity of the solutions to the characteristics of the training set, and limitations of the approach with respect to accuracy and the types of molecules that can be treated. In addition, we introduce a general approach to the generation of training set configurations based on the familiar harmonic approximation and evaluate the possible benefits to the use of quasirandom sequences for sampling configuration space in this context. Using sulfate as a case study, the findings are largely generalizable and expected to ultimately facilitate the efficient development of PIP-based many-body PEFs for general systems via automation.

## I. INTRODUCTION

The use of molecular mechanics force fields or more elaborate representations of the Born–Oppenheimer potential energy surface (PES) is a virtually unavoidable practice in the domain of molecular simulation. The ongoing development of such force fields and surfaces spans many decades, with the resulting approximations falling, as usual, along a spectrum of accuracy vs computational cost. Given the astounding range of spatial and temporal scales over which fundamental chemical and biochemical processes occur, it is clear that the full spectrum of force fields and PESs, including coarse-grained varieties, will continue to play a vital role in the broad arena of molecular simulations.

More recent advances in PES development rely heavily on the fitting and interpolation of large quantities of *ab initio* data and may eschew any physically intuited analytical form of the PES altogether. These include permutationally invariant polynomials (PIPs),^{1,2} neural networks,^{3–8} and kernel-based approaches,^{9–11} which typically strive for the accuracy of high-level electronic structure calculations [ideally CCSD(T) in the complete basis set limit] while being orders of magnitude less computationally intensive, depending on the details of the simulation and complexity of the model PES. In addition to various measures of accuracy, considerations concerning implementation, amount of data required for training, transferability, and computational cost may be important deciding factors when weighing their relative advantages and disadvantages. Interestingly, head-to-head comparisons, namely, Ref. 12, comparing Behler–Parrinello neural networks, Gaussian approximation potentials, and PIP-based potentials for the representation of two- and three-body water interactions, and Ref. 13, comparing the use of Gaussian processes and PIPs for describing hydronium, formaldehyde, and the formic acid dimer, have found the different approaches to be quite competitive with one another. At the same time, Ref. 14, comparing neural network and Gaussian process regression for representing a formaldehyde PES found the Gaussian process PES to yield more accurate vibrational spectra, and also contrasts the two approaches with respect to a number of practical considerations, including cost of evaluation and ease of fitting.

While the focus of this work is the PIP-based representation of single molecules, we consider it in the broader context of PESs based on the well-known many-body expansion, in which the interaction energy of a system of *N* molecules is expressed as a sum of its nested *n*-body contributions, *n* = 1, …, *N*. The many-body expansion is known to converge rapidly for insulators, and in particular for small water clusters,^{15–19} the larger (H_{2}O)_{21} cluster,^{20} and small OH^{−}(H_{2}O)_{n}^{21} and H_{3}O^{+}(H_{2}O)_{n}^{22} clusters, but converges very poorly for metallic systems or systems governed by strong long-range interactions.^{23} As such, PESs taking on this form, such as the CC-pol,^{24,25} WHBB,^{26,27} and MB-pol^{28–30} potentials for water, and the MB-nrg^{31,32} ion–water potentials, are able to perform very well while including only explicit one-, two-, and sometimes three-body terms. The accuracy of the MB-pol and MB-nrg potentials, in particular, has been demonstrated with respect to a variety of high-level *ab initio* data and experimental data, and their moderate computational cost has enabled the study of large and complex systems in full detail, such as bulk aqueous systems, without resorting to less accurate models (e.g., rigid water models). Numerous examples include those found in Ref. 33 for the three phases of water and Ref. 34 for the hydration structure of aqueous Cs^{+}. The demonstrated accuracy of these potential energy functions (PEFs) enables more reliable theoretical predictions in areas where the corresponding experimental protocol has not yet been adequately developed, such as the temperature evolution of the vibrational spectra of aqueous clusters.^{35,36} In the MB-pol and MB-nrg models, each explicit *n*-body contribution is comprised of PIPs for the description of short-range, quantum mechanical interactions and classical descriptions of dispersion, electrostatics, and induction. Higher-order contributions are incorporated via a many-body induction term.

A key aspect of all PIP-based PEFs is the departure from the tenet of transferability, which is abandoned in exchange for a very high degree of accuracy in modeling particular systems of special interest. As such, the generation of many-body PIP-based PEFs is currently a labor-intensive process, as it is necessary to generate a PIP fit for each explicitly included term of a many-body expansion. The investment of time and resources necessary to develop specialized PESs can be easily justified when considering the vast number of problems that can be addressed using only a small number of explicit *n*-body terms, for example, ion hydration. These include a multitude of questions requiring accuracy well beyond the frequently invoked notion of (thermo)chemical accuracy, ∼1 kcal/mol. A much more appropriate target for achieving agreement with increasingly resolved experimental data, and especially for contributing to the interpretation and deciphering of such data, would be that of spectroscopic accuracy, ∼1 cm^{−1}, which appears to be within reach.

Still, widespread adoption of many-body PIP PESs by the computational chemistry community will require an automated approach to their generation, the implementation of which will be dependent upon the extent to which the fitting procedure can be generalized. The problem of fitting PESs is a multifaceted one with many confounding factors. Any black box implementation of this procedure will require critical decisions to be made concerning the complexity of the polynomial to be used, how to generate a training set, proper choice of regularization parameter, and the number of fits to be attempted with little to no input from the user for an arbitrary chemical system. To understand the extent to which the procedure may be generalizable, a number of fundamental aspects of the fitting problem must be investigated, such as the nature of the (generally nonconvex) loss surface, the sensitivity of the solution(s) to various characteristics of the training set, and whether there are any limitations concerning the size and complexity of the chemical systems to be considered, and the level of accuracy that can be achieved. Also among the obstacles to automation lies a key difficulty in fitting PESs of any kind, which is simply attempting to define what constitutes the ideal training set, as this should depend on (1) the intended applications and scientific questions to be addressed and (2) the flexibility of the PEF, which determines any potential trade-offs in accuracy when favoring certain regions of configuration space over others in the process of training/fitting. Both considerations lack any objective, quantitative measure prior to the actual production of multiple fits and their subsequent evaluation and use.

Frequently encountered approaches to the generation of training set configurations include simulation-based approaches, namely, extracting configurations from molecular dynamics (MD) simulations (e.g., Refs. 37 and 28), including their biased counterparts, such as metadynamics (e.g., Ref. 38), and grid-based approaches in Cartesian or spherical coordinates (e.g., Refs. 39 and 40). Of course, a large number of other approaches, including many variations and combinations, can be found in the literature. Additional examples include the approaches used in Ref. 41, one of which relies on error estimation to determine a growing collection of training set configurations while simultaneously improving on the fit, effectively generating an interpolating moving least-squares PES on-the-fly, and that of Ref. 4, which focuses on selecting configurations which are most relevant to computing low-lying vibrational levels of a molecule.

In this work, we provide a comprehensive and detailed overview of the fitting procedure used in the generation of the PIP-based PEFs and explore the methodological and practical considerations described above. In addition, we present a general and intuitive procedure for the efficient generation of training set configurations based on sampling about minima and other stationary points of the PES using the harmonic approximation. This approach stands apart from other normal mode sampling approaches found in the literature (e.g., that of Ref. 42) in at least two important ways: First, a physically relevant distribution is sampled, namely, the (quantum or classical) canonical distribution in the harmonic approximation. (Or at least, physically relevant to the extent that the harmonic approximation is valid. For “floppier” molecules or molecular clusters, this sampling approach would likely need to be used in conjunction with other approaches, such as those in which configurations are extracted from a molecular dynamics or metadynamics trajectory.) Second, all modes are sampled simultaneously and without the use of grids, which scale very poorly with the dimension of configuration space to be sampled, particularly direct-product grids. Although nearly all data presented are for the particular case of sulfate, many of the observations and conclusions drawn can be reasonably extended to a generic molecule.

## II. METHODS

### A. MB-nrg functional form and permutationally invariant polynomials

The functional form of the MB-nrg (many-body energy) family of PEFs is based on the well-known and heavily exploited many-body expansion,^{43} whereby the interaction energy of a system of *N* molecules is expressed as the sum of all *n*-body contributions (*n* = 1, …, *N*),

where r_{i} are the atomic coordinates of the *i*th molecule and the *n*-body contributions are defined recursively by

The rapid convergence of the many-body expansion for aqueous systems makes it possible to obtain close agreement with a large variety of structural, thermodynamic, and dynamical reference data using only up to explicit three-body contributions, particularly when higher-order contributions are incorporated via a many-body induction term, as is done in the MB-pol^{28–30} and MB-nrg^{31,32} PEFs. Moreover, this expansion makes possible the systematic refinement of the model PEF by the explicit inclusion of additional higher-order terms, though these are increasingly challenging to obtain. The inclusion of each *n*-body term requires the generation of a representative training set of configurations, followed by a nonconvex optimization which minimizes the (root mean square) deviation between the energies given by the model and the reference energies. A similarly daunting feature of this approach is the potential combinatorial explosion that can arise when one wishes to consider mixtures of species.

The accurate representation of complex *ab initio* PESs in the MB-nrg PEF family is made possible by the use of PIPs,^{1,2} whose adoption for the purpose of representing molecular PESs was first reported by Bowman and coworkers in 2003 for the case of CH_{5}^{+}.^{37} It is well known that a molecular PES must be invariant to translations and rotations of the molecule, as well as permutations of like atoms (atoms having the same atomic number).^{44,45} One may consider these invariances as ideally arising naturally from the choice of coordinates for the PES.^{1,44} In particular, the set of all interatomic distances clearly satisfies translational and rotational invariance, and while it does not satisfy invariance with respect to permutations of like atoms, it possesses important algebraic attributes that allow permutationally invariant basis functions for the representation of the PES to be constructed.^{1,46} Rather than working directly with interatomic distances as a coordinates, which would lead the representation of the PES to diverge at large interatomic separations, transformed variables *y*_{ij} = *y*(*r*_{ij}) of the interatomic distances *r*_{ij} are introduced, which vanish in the limit of large *r*_{ij}. For the PEFs presented in this work, we employ Morse-type variables^{1} of the form

where the nonlinear parameters *k*_{ij} are to be fit. Note that the same *k*_{ij} is used for “chemically equivalent” pairs of atoms, e.g., a single *k*_{SO} parameter is used for all four sulfur–oxygen distances in the PEF for sulfate, as well as a single *k*_{OO} parameter for all oxygen–oxygen distances.

For a molecule or cluster containing *n* atoms, the potential is expressed as a sum of all power products of the associated *n*(*n* − 1)/2 Morse-type variables *y*_{ij} such that the sum of the exponents does not exceed some specified integer value *k*, i.e., the sum of all *monomials* of degree *k* or less,

This *k* will subsequently be referred to as the order of the polynomial. Note that while the zeroth-order term, a constant, is formally included here, it is identically zero in our one-body fits, which are carried out only for molecules in their electronic ground state, and will henceforth be neglected. Introducing the symmetrization operator $S$, which symmetrizes the monomials as described in Refs. 1 and 47, yields

Although the objective of the symmetrization procedure is to build the permutational invariance into the fitting basis, depending on the particular molecule/system being described, it has the additional benefit of considerably reducing the number of terms in Eq. (4), often dramatically. As a consequence, a *k*th-order polynomial for a relatively large molecule possessing many permutable atoms may contain fewer terms than that of a smaller molecule where few if any atoms are considered equivalent. (This point will be revisited with concrete examples in Sec. III.) Note that for molecules consisting of five or more atoms, the number of interatomic distances, *n*(*n* − 1)/2, exceeds the number of internal degrees of freedom, 3*n* − 6, so that this choice of coordinates is, in a sense, redundant. An implication of this form of the PEF (a sum of highly correlated monomial terms) and this choice of coordinates (which inflate the inherent dimensionality of the problem for systems of five or more atoms) is that some degree of structural and data-based multicollinearity is present, which makes possible large variations in the linear parameters, i.e., the various monomial terms may compensate for one another in various ways. Were it possible to define a suitable linearly independent fitting basis, this variability in the solution would not exist.

The one-body fits presented in this work are based entirely on PIPs. For larger molecules, classical descriptions of electrostatics, induction, and dispersion can be included for nonbonded atoms, e.g., 1–4 interactions or greater. Previously reported MB-nrg potentials—the MB-pol potential for water^{28–30} and alkali-metal-ion–water^{32} and halide–water potentials^{31}—have made use of PIPs at the two- and three-body level, employing the Partridge–Schwenke potential^{40} for the representation of individual water molecules. For these terms, the PIPs provide a short-range correction to classical descriptions of the electrostatic, dispersion, and induction energies, effectively accounting for quantum mechanical effects such as charge transfer, charge penetration, and Pauli repulsion. The ability of the PIPs to quantitatively capture these quantum mechanical effects is demonstrated in Ref. 48 for MB-pol.

### B. Fitting: Simplex minimization, linear regression, and Tikhonov regularization

Solving for the nonlinear parameters *k*_{ij} and linear parameters $C\u0303$ of Eqs. (3) and (5) amounts to a nontrivial, nonconvex optimization problem, i.e., a global optimization problem in which the “best” of many possible solutions must be identified, not unlike the problem of locating the global minimum of a complex potential energy landscape.^{49} In the context of fitting PEFs, we wish to minimize the cost function corresponding to the mean square deviation between the fit energies and the reference *ab initio* energies, $\Vert Efit\u2212Eref\Vert 22$, where ∥·∥_{2} denotes the Euclidean norm. To this end, we apply an iterative approach which combines Nelder–Mead simplex optimization^{50} and linear least squares regression, as has been done in the previously developed MB-pol and MB-nrg PEFs. At the first iteration, an initial “guess” for the nonlinear parameters is made, followed by a round of linear regression to solve for the linear parameters, which depend parametrically upon the nonlinear parameters. A single simplex optimization step for the nonlinear parameters is then performed, followed by successive iterations of linear regression and simplex optimization steps until the desired convergence criteria for the nonlinear parameters are reached. A goal for this nonconvex optimization problem is to have explored the full parameter space of the problem, thus several fits must be attempted using different initial guesses for the nonlinear parameters. In practice, the *k*_{ij} values of the solutions have consistently been found to tend toward moderate values (e.g., typically in the range of 0–3 Å^{−1} for sulfate). In addition, to help avoid overfitting and ensure that the resulting fits are stable, Tikhonov regularization may be incorporated in the linear regression. Generally speaking, such regularization is used when attempting to solve ill-conditioned and/or ill-posed linear problems by imposing a degree of smoothness in the solution. This typically comes at the expense of a modest decrease in the local accuracy of the fit in at least some regions, but at the benefit of yielding a fit which is more stable and better behaved overall.

The procedure is then as follows: For a given collection of nonlinear parameters, we wish to obtain the collection of linear parameters **x** which minimizes the familiar expression,

where each row of $A\u2208Rm\xd7n$ corresponds to a training set configuration (expressed with respect to the PIP fitting basis in the specified Morse-type variables), $x\u2208Rn$ corresponds to the coefficients/linear parameters of the PIP fitting basis, and $b\u2208Rm$ corresponds to the reference/*ab initio* energies for the training set configurations. This is, of course, simply an equivalent expression for our cost function, $\Vert Efit\u2212Eref\Vert 22$. In fitting a PES for a real molecular system, this can surely be expected to be an inconsistent, overdetermined system of equations, i.e., we can expect that no exact solution exists, and therefore the least squares minimization returns an approximate solution,

provided that $ATA\u22121$ exists.

The need for regularization can be appreciated when considering the pseudoinverse *A*^{+} of the ill-conditioned matrix *A*. Expressing *A* by its singular value decomposition, *A* = *U*Σ*V*^{T}, where $U\u2208Rm\xd7m$ and $V\u2208Rn\xd7n$ are orthogonal and $\Sigma =diag(\sigma i)\u2208Rm\xd7n$ is the diagonal matrix of singular values *σ*_{i}, we find that computing the pseudoinverse $A+=Vdiag(\sigma i\u22121)UT$ is numerically unstable and sensitive to small perturbations in *A*. In particular, the relatively large amplification of “noise” linked to the smaller singular values effectively undermines any algorithm which relies on the evaluation of *A*^{+}, regardless of the particular approach or decomposition used. Tikhonov regularization mitigates the impact of these small singular values for ill-conditioned matrices, by taking the singular values of the regularized pseudoinverse matrix $A\alpha +$ to be

for some suitable choice of regularization parameter *α* ∈ [0, ∞), thereby limiting the influence of the smaller singular values with respect to the larger singular values and reducing the effective condition number. With this substitution, the normal equation to be solved is now

where Γ = *αI*_{n}. Equivalently, this can be seen as introducing a regularization term $\Vert \Gamma x\Vert 22$ to our cost function, *L*, which penalizes solutions having larger norms,

When *α* = 0, no regularization takes place. As *α* approaches infinity, all information contained in the system is obliterated and the trivial solution **x** = **0** is obtained. Note that Eq. (9) is convex and so provides a unique minimum for each set of nonlinear parameters and choice of regularization parameter *α*. The choice of *α* is a nontrivial one. Problems which are ideally suited to regularization exhibit a distribution of singular values having a clear gap between larger and smaller singular values. In this case, the value of the regularization parameter is best chosen to be within this gap. For the PEF fitting problem addressed here, the distribution of singular values is instead characterized by a very smooth progression from the largest to smallest singular value. In this case, the choice of regularization parameter must be made with additional care. In practice, this is often done simply by varying the value of *α* and evaluating the resulting fits, e.g., via cross-validation, or by determining the relative contribution of the regularization term to the cost function.

### C. Training set generation

In generating representative training sets for our PEFs, we adopt a conceptually simple yet in practice extraordinarily challenging goal: the complete exploration of all “physically relevant” regions of the high-dimensional, *ab initio* PES of the system, as well as the adequate sampling of “unphysical” regions of configuration space sufficient to ensure the stability of the fit and the absence of “holes” (regions where the PEF must extrapolate from the training set and extreme values of the energy may be wrongly predicted). In particular, we wish to sample about the physically relevant minima, transition states, and saddles of the PES, as these regions are typically the most important in inferring emergent properties of the system. Indeed, numerous methods have been successfully applied to the calculation of thermodynamic properties using only information from the minima of the PES, and even the exploration of complex dynamics of reactive systems using only information from the minima and transition states of the PES (and their connectivity) combined with transition state theory, such as kinetic Monte Carlo and discrete path sampling.^{51} When sampling the minima and transition states of the *ab initio* PES using conventional Monte Carlo (MC) or MD approaches is too costly, as is the case for any system for which we may be interested in developing a PEF, a natural alternative is to make use of the harmonic approximation. In addition to the cost-effectiveness gained by avoiding the closely correlated, successive configurations of a Markov-chain or trajectory, which would require the generation of many more configurations than needed, followed by pruning for an effective training set, sampling the harmonic distribution affords the user a degree of flexibility and control over the generation of the training set that cannot be achieved with MC or MD, such as the ability to manipulate the sequence underlying the sampling.

For this approach, we begin with the canonical partition function for a general, harmonic *N*-particle quantum system,

where

is the Hamiltonian of the harmonic system at temperature *T*, **D** is the displacement-displacement correlation matrix (also the distribution covariance matrix), *K* is the Hessian, *q* is the center of the Gaussian, and *V*_{0} is the minimum energy of the potential. The covariance matrix of this distribution is given by

where **M** = diag(*m*_{i}) is the mass matrix, $\Omega $ = diag(*ω*_{i}) is the frequency matrix, and the auxiliary function *d* is defined as

Note that *d*(*ω*_{i}; *T*) effectively describes the breadth of the distribution at temperature *T* in the dimension of the *i*th normal mode due to the quantum and thermal fluctuations of the system within the harmonic approximation. (This is in contrast to the auxiliary function for the analogous classical system, $dclassical(\omega i;T)=kBT\omega 2$, which gives rise to temperature-independent ratios *d*_{i}/*d*_{j}.)

Following Ref. 52, we sample this distribution using inverse transform sampling, in which an initial sequence of points uniform on [0, 1)^{3N} is transformed to sample a normal distribution $N(q,D)$ having mean **q** and covariance matrix **D**. For each reference configuration, i.e., for each local minimum or stationary point of interest belonging to the *ab initio* PES, a transformation matrix is constructed using the normal modes obtained from the mass-scaled Hessian, which can be readily obtained from any standard electronic structure package. The initial $U$[0, 1)^{3N} points may be obtained from either a pseudorandom or quasirandom sequence and after transformation to $N(q,D)$ represent the displacements in Cartesian coordinates from the reference configuration. Quasirandom, or low-discrepancy, sequences are deterministic sequences designed by number theorists for the express purpose of efficient numerical integration in high dimension. By construction, these sequences possess a local uniformity which makes them superior to pseudorandom sequences in the context of numerical integration, and convergence on the order of ∼1/*N*_{MC} can often be achieved in practice, to be compared with the $1/NMC$ convergence of standard Monte Carlo integration for a sequence of *N*_{MC} points. As we are concerned with efficient sampling of relatively high-dimensional configuration spaces, we have investigated the use of both pseudorandom and quasirandom sequences in the generation of our one-body training sets.

It is important to note that the use of uniform direct-product grids in place of uniform random sequences is completely inadequate once the dimensionality of the configuration space becomes too large, which is already the case for rather modest systems containing only a few atoms. To illustrate this point, we can consider the 15-dimensional configuration space of sulfate. A fourth-order PIP representation of the PES requires fitting 82 linear parameters, for which ∼1000 configurations can be expected to be sufficient to avoid an overfitted PEF (based on the data shown in Fig. 9). Using the crudest grid possible, consisting of only 2 values for each component/degree of freedom, attempting the use of a uniform direct-product grid to sample all 9 vibrational degrees of freedom simultaneously would require 2^{9} = 512 configurations. However, the use of an incrementally finer regular grid consisting of 3 values for each of the 9 degrees of freedom would amount to 3^{9} = 19 683 configurations. Clearly, the use of grids to sample all degrees of freedom simultaneously is quickly doomed to fail, even if attempting a compromise where some dimensions are sampled more densely than others. Alternatively, one could settle for systematically sampling subspaces of lower dimensionality independently, which is likewise inefficient. Moreover, there is a high degree of redundancy in sampling with uniform grids, in the sense that each component sampled takes on only a few distinct values from the range of possible values it can assume. By contrast, using random sequences, one can readily sample all degrees of freedom simultaneously using any number of configurations, easily append or prune existing training sets (though this aspect is nontrivial when quasirandom sequences are being employed), and also allow each component to take on a great variety of values.

In this sampling approach, which we will henceforth refer to loosely as sampling the “harmonic distribution” [Eq. (10)], the only free parameter to be chosen is the temperature. A “good” training set leading to an accurate and stable fit must contain highly distorted, high-energy structures found at higher temperatures, in addition to the lower-energy structures which are expected to be much more frequently encountered in typical MD or MC simulations. Our results suggest that a suitable compromise is obtained when sampling the harmonic distribution over a range of temperatures, and we have settled on a somewhat heuristic approach of using either a linear (*T*_{i+1} − *T*_{i} = constant) or geometric (*T*_{i+1}/*T*_{i} = constant) temperature progression when sampling the harmonic distribution. Supposing that the functional form of the PEF is sufficiently complex to represent high-dimensional PESs (i.e., for PIPs of sufficiently high order), the ideal choice of temperatures over which to sample configurations would appear to be that which adequately “excites” all of the normal modes of the system/molecule simultaneously. This view is held in spite of the fact that the maximum temperature used will be much higher than what one would ever intend to use in a simulation, as these high-energy, highly distorted configurations obtained from sampling at very high temperature lend *stability* to the resulting polynomial fits. In addition, note that sampling using a classical distribution, either via the harmonic approximation or via classical MD or MC simulations, would result in an undersampling of the high-frequency modes with respect to the quantum distribution, which we employ. Sampling approaches which attempt to include only “physically relevant” configurations, i.e., almost exclusively low-energy configurations, increase the risk of holes in the resulting PEF, where the fit is forced to extrapolate in regions of configuration space that were effectively not represented in the training set, and the exponentials comprising the PEF return extreme values for the energy.

Additional complications may arise when applying the sampling scheme proposed here to systems that possess a relatively broad range of eigenfrequencies, i.e., to “floppy” systems. In such cases, attempting to sample at temperatures that are high enough to adequately “excite”/sample high-frequency modes causes the low-frequency modes to become “overexcited” and for bonds to break. This is illustrated indirectly in Fig. 1, which shows the temperature dependence of *d*(*ω*_{i}; *T*) [Eq. (13)] for several different normal modes of SO_{4}^{2−} and N_{2}O_{5}. This can be overcome by invoking the well-known concept of the Einstein temperature from the study of crystal lattices, where each lattice vibration is associated with a characteristic temperature Θ_{E} = *ℏω*/*k*_{B}. By sampling each mode at a characteristic temperature directly related to its frequency, we are able to sample each degree of freedom broadly enough to result in a stable fit, yet without “breaking” the molecule or cluster. In this case, we sample the above normal distribution over a range of *A* ≔ *k*_{B}*T*_{i}/*ℏω*_{i} so that the breadth of sampling along all modes is more regular,

Sampling in this manner has the effect of constraining the ratios *d*(*ω*_{i})/*d*(*ω*_{j}) and ultimately preventing the Cartesian displacements of the coordinates from becoming unsuitably large. While these configurations are perhaps less physically relevant, they are expected to result in more stable fits by exploring regions of configuration space that would otherwise be neglected, thereby avoiding the need for potentially wild extrapolation. An illustration of the transformation from $U$[0, 1)^{3N}- to $N(q,D)$-distributed points is shown in Fig. 2 for the case of N_{2}O_{5} using both pseudorandom and quasirandom sequences, and both Eqs. (13) and (14).

## III. RESULTS AND DISCUSSION

Using the methods for fitting and training set generation outlined in Sec. II, PIP-based one-body PEFs were fit to *ab initio* electronic structure data for a variety of chemical species. Among the many species for which we have developed PEFs, we have chosen to focus on the relatively simple and well-behaved case of sulfate for this manuscript, setting aside the more challenging systems, including dihaloethanes and small-molecule–water dimers. Note that while these other cases represent more difficult sampling and fitting problems, the features discussed here for sulfate are still relevant, as they illustrate a number of general aspects of the overall procedure and features of the resulting fits.

Regarding the fitting procedure as a whole, one can (naïvely) consider the number of terms comprising the PIP as a reflection of the difficulty and complexity of the problem, as this partly determines the dimensionality of the linear regression problem, along with the adequate size of the training set. A summary of the total number of symmetrized monomial terms comprising the PIP PEFs for a variety of systems is presented in Table I. Note the quasi-exponential growth in the number of terms with the order of the polynomial, and the dramatic effect that the presence of chemically equivalent, permutable atoms has on the number of terms, owing to the symmetrization procedure. As a particular example, without symmetrization the PIPs of order 1 through 4 for ethane would consist of 28, 434, 4494, and 35 959 terms, respectively. With symmetrization, the corresponding PIPs instead contain only 3, 15, 59, and 227 terms. To some extent, the number of nonlinear parameters can be seen to “set the tone” for the growth in the overall number of terms with increasing order, with the number of linear parameters in a first-order polynomial necessarily equal to the number of nonlinear parameters. (Note that a first-order polynomial effectively yields a Born–Mayer-type or “TTM-nrg” potential.^{53,54}) For a system of *n* atoms, the upper bound on the number of nonlinear parameters is *n* choose 2, corresponding to the case where all *n* atoms are considered chemically distinct and do not permute with one another, with the number of monomial terms given by the sum of multinomial coefficients,

The lower bound on the number of nonlinear parameters is 1, corresponding to the case where all atoms are equivalent/permutable (e.g., a homocyclic compound, such as *cyclo*-S_{8}).

. | . | Number of nonlinear . | Total number of linear parameters . | |||||
---|---|---|---|---|---|---|---|---|

Species . | Permutations . | parameters . | k = 1
. | k = 2
. | k = 3
. | k = 4
. | k = 5
. | k = 6
. |

H_{2}, N_{2}, O_{2} | 1 | 1 | 1 | 2 | 3 | 4 | 5 | 6 |

CO, CN^{−} | 1 | 1 | 1 | 2 | 3 | 4 | 5 | 6 |

CO_{2}, H_{2}O | 2 | 2 | 2 | 6 | 12 | 21 | 33 | 49 |

CO_{3}^{2−}, NO_{3}^{−}, NH_{3}, PH_{3} | 6 | 2 | 2 | 8 | 22 | 50 | 102 | 195 |

SO_{4}^{2−}, BF_{4}^{−}, NH_{4}^{+} | 24 | 2 | 2 | 9 | 29 | 82 | 207 | 494 |

PF_{6}^{−} | 720 | 2 | 2 | 9 | 32 | 108 | 353 | 1 135 |

CH_{2}O (formaldehyde), HCOO^{−} (formate) | 2 | 4 | 4 | 17 | 49 | 119 | 255 | 501 |

HCOF (formyl fluoride) | 1 | 6 | 6 | 27 | 83 | 209 | 461 | 923 |

HCOOH (formic acid)^{a} | 4 | 5 | 5 | 27 | 101 | 322 | 903 | 2 303 |

HCOOH (formic acid)^{b} | 2 | 7 | 7 | 41 | 167 | 560 | 1 631 | 4 263 |

C_{2}H_{2} (acetylene) | 4 | 3 | 3 | 12 | 32 | 74 | 152 | 290 |

C_{2}H_{4} (ethene) | 48 | 3 | 3 | 15 | 56 | 193 | 608 | 1 809 |

C_{2}H_{6} (ethane) | 1 440 | 3 | 3 | 15 | 59 | 227 | 854 | 3 201 |

C_{2}F_{2}H_{2}, C_{2}Cl_{2}H_{2} (1,2-dihaloethenes) | 8 | 6 | 6 | 39 | 180 | 723 | 2 574 | 8 343 |

C_{2}F_{2}H_{4}, C_{2}Cl_{2}H_{4} (1,2-dihaloethanes) | 96 | 6 | 6 | 43 | 237 | 1 214 | 5 800 | 26 164 |

C_{3}H_{8}O (propanol, ethyl methyl ether)^{a} | 241 920 | 5 | … | … | … | … | … | … |

CH_{3}CH_{2}CH_{2}OH (1-propanol)^{b} | 24 | 31 | 31 | 622 | 9 321 | 116 997 | … | … |

CH_{3}CHOHCH_{3} (2-propanol)^{b} | 36 | 30 | 30 | 568 | 8 039 | 95 964 | … | … |

CH_{3}CH_{2}OCH_{3} (ethyl methyl ether)^{b} | 72 | 72 | 404 | 5255 | 58 976 | … | … | … |

C_{4}H_{10} (n-butane, isobutane)^{a} | 87 091 200 | 3 | … | … | … | … | … | … |

C_{4}H_{10} (n-butane)^{b} | 69 120 | 10 | 10 | 105 | 900 | 7 307 | 57 167 | … |

HC(CH_{3})_{3} (isobutane)^{b} | 2 177 280 | 8 | 8 | 66 | 444 | 2 772 | … | … |

S_{6} (cyclo-hexasulfur) | 720 | 1 | 1 | 4 | 12 | 33 | 85 | 217 |

S_{7} (cyclo-heptasulfur) | 5 040 | 1 | 1 | 4 | 12 | 34 | 94 | 267 |

S_{8} (cyclo-octasulfur) | 40 320 | 1 | 1 | 4 | 12 | 35 | 99 | 296 |

C_{4}H_{4}O (furan)^{a} | 576 | 5 | 5 | 35 | 197 | 1 048 | 5 323 | 26 149 |

C_{4}H_{4}N_{2} (pyrimidine)^{a} | 1 152 | 6 | 6 | 47 | 296 | 1 783 | 10 307 | 57 861 |

C_{5}H_{4}N_{4} (purine)^{a} | 69 120 | 6 | 6 | 51 | 360 | 2 490 | 16 981 | … |

C_{6}H_{6} (benzene) | 518 400 | 3 | 3 | 18 | 88 | 428 | 2 099 | … |

C_{3}H_{7}ON (N-methylacetamide)^{a} | 30 240 | 8 | 8 | 66 | 444 | 2 771 | 16 618 | … |

C_{3}H_{7}ON (N-methylacetamide)^{b} | 1 | 30 | 30 | 568 | 8 039 | 95 964 | … | … |

C_{5}H_{9}N_{2}^{+} (“C1mim”)^{a} | 87 091 200 | 6 | … | … | … | … | … | … |

C_{5}H_{9}N_{2}^{+} (“C1mim”)^{b} | 11 520 | 26 | 26 | 500 | 7 521 | … | … | … |

C_{6}H_{11}N_{2}^{+} (“C2mim”)^{a} | 57 480 192 000 | 6 | … | … | … | … | … | … |

C_{6}H_{11}N_{2}^{+} (“C2mim”)^{b} | 72 | 94 | 94 | 4863 | 178 057 | … | … | … |

. | . | Number of nonlinear . | Total number of linear parameters . | |||||
---|---|---|---|---|---|---|---|---|

Species . | Permutations . | parameters . | k = 1
. | k = 2
. | k = 3
. | k = 4
. | k = 5
. | k = 6
. |

H_{2}, N_{2}, O_{2} | 1 | 1 | 1 | 2 | 3 | 4 | 5 | 6 |

CO, CN^{−} | 1 | 1 | 1 | 2 | 3 | 4 | 5 | 6 |

CO_{2}, H_{2}O | 2 | 2 | 2 | 6 | 12 | 21 | 33 | 49 |

CO_{3}^{2−}, NO_{3}^{−}, NH_{3}, PH_{3} | 6 | 2 | 2 | 8 | 22 | 50 | 102 | 195 |

SO_{4}^{2−}, BF_{4}^{−}, NH_{4}^{+} | 24 | 2 | 2 | 9 | 29 | 82 | 207 | 494 |

PF_{6}^{−} | 720 | 2 | 2 | 9 | 32 | 108 | 353 | 1 135 |

CH_{2}O (formaldehyde), HCOO^{−} (formate) | 2 | 4 | 4 | 17 | 49 | 119 | 255 | 501 |

HCOF (formyl fluoride) | 1 | 6 | 6 | 27 | 83 | 209 | 461 | 923 |

HCOOH (formic acid)^{a} | 4 | 5 | 5 | 27 | 101 | 322 | 903 | 2 303 |

HCOOH (formic acid)^{b} | 2 | 7 | 7 | 41 | 167 | 560 | 1 631 | 4 263 |

C_{2}H_{2} (acetylene) | 4 | 3 | 3 | 12 | 32 | 74 | 152 | 290 |

C_{2}H_{4} (ethene) | 48 | 3 | 3 | 15 | 56 | 193 | 608 | 1 809 |

C_{2}H_{6} (ethane) | 1 440 | 3 | 3 | 15 | 59 | 227 | 854 | 3 201 |

C_{2}F_{2}H_{2}, C_{2}Cl_{2}H_{2} (1,2-dihaloethenes) | 8 | 6 | 6 | 39 | 180 | 723 | 2 574 | 8 343 |

C_{2}F_{2}H_{4}, C_{2}Cl_{2}H_{4} (1,2-dihaloethanes) | 96 | 6 | 6 | 43 | 237 | 1 214 | 5 800 | 26 164 |

C_{3}H_{8}O (propanol, ethyl methyl ether)^{a} | 241 920 | 5 | … | … | … | … | … | … |

CH_{3}CH_{2}CH_{2}OH (1-propanol)^{b} | 24 | 31 | 31 | 622 | 9 321 | 116 997 | … | … |

CH_{3}CHOHCH_{3} (2-propanol)^{b} | 36 | 30 | 30 | 568 | 8 039 | 95 964 | … | … |

CH_{3}CH_{2}OCH_{3} (ethyl methyl ether)^{b} | 72 | 72 | 404 | 5255 | 58 976 | … | … | … |

C_{4}H_{10} (n-butane, isobutane)^{a} | 87 091 200 | 3 | … | … | … | … | … | … |

C_{4}H_{10} (n-butane)^{b} | 69 120 | 10 | 10 | 105 | 900 | 7 307 | 57 167 | … |

HC(CH_{3})_{3} (isobutane)^{b} | 2 177 280 | 8 | 8 | 66 | 444 | 2 772 | … | … |

S_{6} (cyclo-hexasulfur) | 720 | 1 | 1 | 4 | 12 | 33 | 85 | 217 |

S_{7} (cyclo-heptasulfur) | 5 040 | 1 | 1 | 4 | 12 | 34 | 94 | 267 |

S_{8} (cyclo-octasulfur) | 40 320 | 1 | 1 | 4 | 12 | 35 | 99 | 296 |

C_{4}H_{4}O (furan)^{a} | 576 | 5 | 5 | 35 | 197 | 1 048 | 5 323 | 26 149 |

C_{4}H_{4}N_{2} (pyrimidine)^{a} | 1 152 | 6 | 6 | 47 | 296 | 1 783 | 10 307 | 57 861 |

C_{5}H_{4}N_{4} (purine)^{a} | 69 120 | 6 | 6 | 51 | 360 | 2 490 | 16 981 | … |

C_{6}H_{6} (benzene) | 518 400 | 3 | 3 | 18 | 88 | 428 | 2 099 | … |

C_{3}H_{7}ON (N-methylacetamide)^{a} | 30 240 | 8 | 8 | 66 | 444 | 2 771 | 16 618 | … |

C_{3}H_{7}ON (N-methylacetamide)^{b} | 1 | 30 | 30 | 568 | 8 039 | 95 964 | … | … |

C_{5}H_{9}N_{2}^{+} (“C1mim”)^{a} | 87 091 200 | 6 | … | … | … | … | … | … |

C_{5}H_{9}N_{2}^{+} (“C1mim”)^{b} | 11 520 | 26 | 26 | 500 | 7 521 | … | … | … |

C_{6}H_{11}N_{2}^{+} (“C2mim”)^{a} | 57 480 192 000 | 6 | … | … | … | … | … | … |

C_{6}H_{11}N_{2}^{+} (“C2mim”)^{b} | 72 | 94 | 94 | 4863 | 178 057 | … | … | … |

Two modifications to the polynomials [Eq. (5)] can be made, which lead to intermediate numbers of monomial terms for a given value of *k*. The first is to impose an additional restriction on the exponents of one or more of the Morse-type variables as suggested by Xie and Bowman,^{46} resulting in fewer monomial terms, e.g., developing a fourth-order PIP PEF for acetone [(CH_{3})_{2}CO], but restricting the exponent on the C=O Morse-type variable to be 3 or less. The second is the introduction of fictitious sites which enter Eq. (5) as pseudoatoms, which may or may not be allowed to permute with one another, as has been done in the MB-pol and MB-nrg potentials, among others. The presence of these pseudoatoms increases both the number of nonlinear and linear terms in the same manner as true atoms. While we do not consider such cases in this work, our findings so far indicate that if there is anything to be gained or lost from either of the two approaches, it is due solely to the change in the number of linear parameters, and independent of the details of how either approach was performed.

A more fundamental consideration having more substantial implications would appear to be the choice of permutable atoms for the model PEF. Formally, the potential energy is of course invariant to permutations of like atoms (atoms having the same atomic number *Z*). For a confined system consisting of *N* identical atoms, there are then *N*! degenerate minima of the PES for each geometrically distinct locally optimal configuration. However, the form of the PIP PEF, with interatomic distances being used as coordinates, suggests at least on an intuitive level that the connectivity, or local environment, of each atom should perhaps also be considered when building a degree of permutational invariance into the model PEF. As an example, one can consider *n*-butane, for which a proper PIP PEF would satisfy the permutational invariance of all carbon and all hydrogen atoms with one another, and consist of only 3 nonlinear parameters (*k*_{CC}, *k*_{CH}, and *k*_{HH}). Still, it may appear advantageous from the vantage of generating an accurate nonreactive PEF to treat the terminal carbons as being distinct from the central carbons, as well as their respectively bound hydrogen atoms. The resulting PEF would then possess 10 nonlinear parameters ($kCcentralCcentral$, $kCcentralCterminal$, …, $kHterminalHterminal$). One would then hope that a higher level of accuracy could be achieved at the expense of generating PIPs having more complex parameterizations (more nonlinear parameters), in a way that does not depend trivially on the number of linear parameters. We do not attempt to resolve this issue in this work, but have included the number of terms present in each scenario for a number of molecules in Table I.

In the context of the sampling approach proposed here, perhaps the most important property of each compound is its characteristic “dynamical range,” that is, the distribution of eigenfrequencies which determines the relative breadth of the distribution in each dimension. This range can additionally be taken as an indication of the anharmonicity or “floppiness” of the system, where large-amplitude, low-frequency modes are understood to be more anharmonic in nature. In addition, there is the more general challenge of sampling efficiently high-dimensional configuration space as a whole and sampling an adequate number of local minima and stationary points, in particular. In this regard, sulfate serves as a well-behaved case study, possessing a single minimum-energy configuration to be sampled about, and exhibiting a relatively narrow range of eigenfrequencies, with *ω*_{max}/*ω*_{min} = 2.5 for the *ω*B97M-V/apc-2-optimized structure. For perspective, this ratio is an order of magnitude larger for the anti conformer of 1,2-difluoroethane (*ω*_{max}/*ω*_{min} = 28.3), the local and global minima of N_{2}O_{5} (*ω*_{max}/*ω*_{min} = 61.6 and 87.5, respectively), and the sulfate–water dimer (*ω*_{max}/*ω*_{min} = 88.8), and two orders of magnitude larger for the fluoride–water trimer, F^{−}(H_{2}O)_{2} (*ω*_{max}/*ω*_{min} = 200.9).

### A. Training sets: Impact of “temperature” (breadth of sampling)

Given that the obvious goal of the fitting procedure is to accurately reproduce energies with respect to some reference (e.g., CCSD(T), *ω*B97M-V/apc-2), one of the more relevant attributes of the training set would appear to be the distribution of energies that are being sampled, including how this distribution compares with the distribution of energies expected to arise in practice, particularly in a typical MC or MD simulation. More precisely, one would like to be able to characterize the differences in the configurations generated by sampling the harmonic distribution, and its *ad hoc* fixed-*A* variation, with those encountered in an MC or MD simulation, but we settle for considering the energies as a simple proxy. A training set which is comprised only of configurations that have been extracted from a typical simulation (e.g., NPT-ensemble MD at ambient pressure and temperature) may at first glance seem desirable, as it can be expected to produce more accurate energies with “conventional” use. However, this approach is very likely to come at the steep price of compromising the overall stability of the resulting fit. We reiterate that the end goal in the development of the MB-nrg PEFs is to accurately describe the system over a broad range of physical conditions, across all phases. In particular, the one-body PEF must be able to accurately describe distorted configurations that may be adopted in the condensed phase, and, ideally, configurations approaching those corresponding to chemical events such as hydrogen transfer. [Note that while PIPs have been used to generate reactive potentials (see, for example, Refs. 55–58), the potentials described here are not intended to be reactive.] In addition, it should be possible to reliably combine the PEFs with methods that require accurate, or at least reasonable, energies for high-energy or potentially “unphysical” configurations, such as those encountered during MC initialization and equilibration or when using grid-based methods in a variety of contexts, e.g., for solving the vibrational Schrödinger equation. It would seem clear, then, that an ideal “complete” training set would possess a ratio of highly distorted, high-energy configurations to low-energy configurations which balances the need for stability with the need for accuracy, which is of course dependent upon the intended applications of the PEF and scientific questions to be addressed.

To this end, a collection of training sets and corresponding fits for sulfate were generated by sampling the distributions described in Sec. II C at various temperatures [Eqs. (10), (12), and (13)] and for various ratios *A* ≔ *k*_{B}*T*_{i}/(*ℏω*_{i}) [Eqs. (10), (12), and (14)]. Each training and test set consists of 2048 configurations. A Sobol sequence was used to generate the training sets, and a pseudorandom sequence was used to generate the test sets. The *ω*B97M-V functional^{59} with the apc-2 basis set^{60–62} was chosen as a reference. All electronic structure calculations were performed using Q-Chem 4.4.^{63} As discussed in Sec. II C, we have chosen to use the eigenfrequencies as a natural guide for selecting temperature ranges. For example, the highest-frequency normal modes of the *ω*B97M-V/apc-2-optimized geometry of sulfate have frequencies of ≈1077 cm^{−1}, corresponding to a temperature of ∼1550 K, while the lowest- (nonzero) frequency normal modes have frequencies of 426 cm^{−1}, corresponding to a temperature of ∼613 K. The resulting energy distributions are shown in Fig. 3 with respect to the global minimum. Because we have chosen to perform our sampling using the quantum partition function, the low-temperature distributions are not peaked near the minimum of the Born–Oppenheimer surface, but rather nearer to the zero-point energy (9.754 kcal/mol). In contrast to the previously reported MB-pol and MB-nrg potentials, here we have chosen to forgo any type of weighting when performing the regression for one-body PEFs, which could be used to increase the contribution of the low-energy configurations to the cost function, thereby increasing their influence over the resulting fits.

In an attempt to gain a quantitative picture of how “transferable” fits resulting from training sets generated at different values of *T* and *A* are, each of the generated fits was applied to all test sets generated for the same collection of *T* and *A* values. The resulting root mean squared deviations (RMSDs), mean absolute errors (MAEs), and maximum errors were then evaluated. RMSDs for third- and fourth-order sulfate fits are given in Figs. 4 and 5, respectively, with darker-colored cells indicating larger RMSD values. While the RMSD and MAE yield values that can be readily understood in a physical sense, one must keep in mind that these values are not only a reflection of the performance of the fit, but also of the distribution of energies of the test set. As such, any quantitative head-to-head comparisons of RMSD or MAE values should be made only with respect to the same test set (i.e., only across a row in Fig. 4 or Fig. 5). That the maximum error is determined entirely from any single configuration generally makes it a poor metric for comparison in this context, especially considering the existence of multiple solutions (discussed in detail in Secs. III B and III C) which may perform similarly well with respect to a sizable test set, but very dissimilarly for any particular configuration. As such, RMSD results alone are reported here, though the results for the remaining metrics considered reveal the same general trends to at least some extent.

First and foremost, it must be recognized that *all* of the third- and fourth-order fits for this system perform excellently. All RMSDs and MAEs (not shown) are well within 1 kcal/mol, and even the maximum-error values (not shown) exceed 1 kcal/mol only for configurations whose reference distortion energies are well over 50 kcal/mol. Nevertheless, some fits can be seen to perform better than others overall. The most obvious and easily anticipated observation is that the fits trained on the less distorted, lower energy configurations are not ideally suited for predicting energies of highly distorted, high-energy configurations, and vice versa. Even so, it can be argued that the improved accuracy in describing higher-energy configurations gained when training a fit on high-energy configurations is well worth the reduced accuracy in describing low-energy configurations, especially for the more flexible fourth-order polynomials. For example, although the RMSD for the *T* = 0 K test set increases 3.7-fold when the training set temperature is increased from 0 K to 1550 K (row 1 of Fig. 4), a significant improvement in the RMSDs of the *T* = 1000 through 3100 K test sets is achieved (rows 4 through 8 of Fig. 4). The trade-off in accuracy is more palatable for the more flexible fourth-order polynomials, where, for example, the RMSD for the *T* = 0 K test set increases 5.2-fold when the training set temperature is increased from 0 K to 3100 K (row 1 of Fig. 5), yet the RMSDs for the *T* = 1000 through 3100 K test sets are appreciably improved (rows 4 through 8 of Fig. 5), as well as for the *A* = 0.5 through 2, “Geometric,” and “Linear” test sets (rows 9 through 13 of Fig. 5). It seems advisable then to sample over a range of temperatures when generating a training set, as the PIPs of high-enough order are sufficiently flexible and complex to describe both low- and high-energy configurations well.

### B. Sensitivity of the solution space with respect to training set “temperature” (breadth of sampling)

Figures 4 and 5 show clearly the consistent manner in which the fit RMSDs for a given test set vary smoothly and commensurately with respect to the temperature at which the fit was trained. This smooth and consistent trend suggests a systematic drift in the solutions obtained depending on the breadth of sampling of the training set, which we demonstrate in Fig. 6 for polynomials of order *k* = 1 through 6. Recall that for each set of nonlinear parameters, the linear parameters are uniquely determined so that the solution/parameterization of the fit is uniquely described by the values of the nonlinear parameters. Note that regardless of temperature, for *k* = 2, 3, and 4, two solutions are found, confined to two respective neighborhoods. This overall confinement of the solutions and their gradual, systematic change with training set sampling breadth suggests that regardless of the temperature used to generate the training set, the resulting solution(s) should be rather robust and perform reasonably well for test set configurations generated at different temperatures, which is corroborated by Figs. 4 and 5. In particular, for *k* = 2 and 3, we find the solutions of one neighborhood, corresponding to smaller *k*_{OO} values, to be more localized. It is tempting to interpret the solutions in these neighborhoods as exhibiting significantly less sensitivity to the temperature at which the training set was generated, which would suggest that these solutions are more stable and robust, i.e., better able to provide accurate energies for a more diverse sample of “unfamiliar” configurations. However, this is not necessarily the case, as the *y*_{OO} Morse-type variables exhibit less variation with respect to the *r*_{OO} distances for smaller values of *k*_{OO}, the *k*_{SO} need not change much in order to balance the shifting *k*_{OO} values in these neighborhoods.

Finally, while *k*_{SO} and *k*_{OO} do act as (inverse) range parameters in the PIPs [Eq. (5)], Fig. 6 illustrates that they should not be imbued with any physical meaning for PIPs of order greater than 1. Likewise, it does not seem possible to reliably predict the value of one from the other. For example, for *k* = 2, 3, and 4, different trends are observed in the simultaneous increase/decrease of *k*_{SO} and *k*_{OO} with training set temperature. In particular, for the *k* = 3 solution neighborhood at larger *k*_{OO}, the value of *k*_{SO} decreases substantially with increasing temperature, indicating an increase in the range of the four sulfur–oxygen Morse-type variables, while the value of *k*_{OO} simultaneously undergoes a modest decrease, indicating a small increase in the range of the six oxygen–oxygen Morse-type variables. For the solution neighborhood at smaller *k*_{OO}, the value of *k*_{SO} instead increases, with the value of *k*_{OO} simultaneously undergoing a modest increase in value. In short, the form of the PIPs is clearly such that the nonlinear parameters may compensate for one another in such a manner that they should not be considered to have a reliable physical meaning independently from one another. For *k* = 5 and 6, it is apparent that the number of linear parameters is sufficiently large to result in converged, low-RMSD solutions for a broad range of nonlinear parameter values, rendering attempts at interpreting values and behavior of *k*_{SO} and *k*_{OO} especially futile. At the same time, it is interesting, though not surprising, to note that the more diverse and broadly sampled training sets result in a smaller number of distinct solutions which are more localized. Finally, for *k* = 1, the distinct solutions tend to be more extreme, with one nonlinear parameter effectively taking on the value of zero. When this is the case, the corresponding linear parameter/coefficient also takes on an extreme value, which is many orders of magnitude larger than the other.

Although it is not relevant in the case of sulfate, we add briefly that the existence of multiple solutions invites the possibility of combining solutions into a single PEF, which may be worthwhile for more complicated systems and scenarios, especially where different solutions best represent qualitatively different subsets of the training set. One can consider the reactive, global PEF for formaldehyde of Ref. 64 as an example.

### C. Survey of the solution space

As a more readily interpreted variation of the cost function surface, or loss surface, we consider the RMSD surfaces for polynomials of order 1 through 6 in Fig. 7. A training set consisting of 8192 configurations was used, generated using a linear progression of temperature and *A* ≔ *k*_{B}*T*_{i}/*ℏω*_{i} values, as described in the caption of Fig. 3. With the linear parameters uniquely determined for each pair of nonlinear parameters (*k*_{SO}, *k*_{OO}), the surfaces can be calculated and plotted unambiguously for a given training set. For each order of polynomial, 10^{4} randomly chosen (*k*_{SO}, *k*_{OO}) pairs were generated on [0, 3]^{2} and the corresponding linear parameters and RMSD for the training set computed. This interval was chosen following preliminary calculations with initial guesses for the nonlinear parameters being generated on [0, 25]^{2}, and finding that the fitting procedure reliably converged to fits with (*k*_{SO}, *k*_{OO}) ∈ [0, 3]^{2} for polynomials of order 2 or greater.

The most striking feature of the RMSD surface is the existence of two distinct minima for lower-order polynomials, which gradually merge into a single broad and shallow minimum for higher-order polynomials. The barrier at $kOO\u223c23kSO$ which divides the two solutions appears to be a direct consequence of the form of the polynomials and the ratio of mean sulfur–oxygen to mean oxygen–oxygen distances over the training set, $r\xafOO/r\xafSO$. When $kOO\u2248r\xafOO/r\xafSOkSO=1.6\u2009kSO$, $yOiOj\u2248exp{\u2212r\xafSO}$, a constant, and the polynomial is effectively reduced to having a single nonlinear parameter and monomials which roughly reduce to constants. (This is corroborated empirically by RMSD surfaces computed for nitrate, which can be found in the supplementary material.) That there are effectively no restrictions on the values of *k*_{SO} and *k*_{OO} for the fifth- and sixth-order fits suggests that these polynomials contain more terms than are needed to accurately convey the information contained in the training set, and that *k*_{SO} and *k*_{OO} have no special meaning for these higher-order polynomials, physical or otherwise. In other words, for virtually any (*k*_{SO}, *k*_{OO}) in [0, 3]^{2}, there are enough terms present to obtain a fit having very low RMSD with respect to the training set, but which will more likely suffer from overfitting. Although we do not attempt a demonstration, it is expected that the 8192 configurations have adequately sampled the distribution so that using a larger training set following this same distribution would not introduce sufficient additional information to affect this. In other words, one can reasonably suppose that the higher-order (fifth- and sixth-order) models are, in a sense, more complex than what is needed to represent this distribution of configurations, making some degree of overfitting inevitable, i.e., the information available in this distribution that can be captured by the model has been exhausted. Should additional training set configurations be added from a *different* distribution, particularly one having a large Kullback–Leibler divergence with respect to the distribution used in the text, then one could more readily expect a greater sensitivity of the RMSD to the (*k*_{SO}, *k*_{OO}) values for the fifth- and sixth-order fits, and a reduced tendency toward overfitting.

From a practical standpoint, the RMSD surfaces for this system appear quite reassuring, overall: although there could in principle be many local minima on the cost function surface, and the solution obtained using the Nelder–Mead simplex optimization is dependent on the initial guess of nonlinear parameters, we find that for this rather simple system and PIP, the optimal solution can be easily found, i.e., only a very small and very manageable number of initial guesses must be made in order to locate the minimum-RMSD solution. For a more general case concerning *n* nonlinear parameters, *k*_{1}, …, *k*_{n}, one might anticipate barriers in the RMSD surfaces at $ki\u223cr\xafIr\xafJkj$ separating local minima. If this is indeed the case, then the maximum number of possible solutions on some large-enough interval [0, *k*_{1,max}] ×⋯× [0, *k*_{n,max}] could be predicted ($212n(n\u22121)$), as well as regions where initial guesses for the nonlinear parameters should be included in order to precisely locate all possible solutions. At the same time, one might also expect that the heights of these barriers should decrease dramatically as the number of nonlinear parameters increases so that for even a modest number of nonlinear parameters, such barriers in the loss surface are effectively nonexistent. In short, with respect to the complexity of the loss surface, it appears that the problem of one-body PES-fitting using PIPs should be very tractable even for a rather large number of nonlinear parameters.

The practical outcome of the fitting procedure is presented in Table II, where we list the distinct solutions obtained for polynomials of order 1 through 6 and the frequency with which they were obtained when actually performing the full fitting procedure using 100 initial guesses for (*k*_{SO}, *k*_{OO}) on [0, 3]^{2}. A tolerance of 10^{−8} Å^{−1} on each nonlinear parameter was used in the simplex optimization as a threshold for convergence, and initial simplex sides of length 0.1 Å^{−1} were used. Based on the relative areas on either side of the boundary and uniform (random) initialization of (*k*_{SO}, *k*_{OO}) on [0, 3]^{2}, one might expect the Nelder–Mead simplex optimization to converge to the $kOO>23kSO$ minimum about twice as often as the $kOO>23kSO$ minimum. This is indeed the case for *k* = 2, where the barrier is well-defined on the whole of [0, 3]^{2}, but the convergence to the global minimum becomes more frequent as the order of the polynomial increases and the barrier dissipates. The general trend would then seem to be that as the number of polynomial terms is increased, the $ki=r\xafIr\xafJkj$ boundaries dissipate, facilitating the convergence of the simplex algorithm to lower-lying minima.

. | Training set RMSD . | Test set RMSD . | Training set max . | Test set max . | . | . | . |
---|---|---|---|---|---|---|---|

Polynomial order . | (kcal/mol) . | (kcal/mol) . | error (kcal/mol) . | error (kcal/mol) . | k_{SO} (Å^{−1})
. | k_{OO} (Å^{−1})
. | Frequency . |

1 | 6.98 | 7.02 | 50.6 | 74.2 | 19.15 | 8.02 | 71 |

1 | 7.16 | 7.08 | 73.7 | 88.8 | 18.06 | ∼0 | 2 |

1 | 7.42 | 7.48 | 61.8 | 86.5 | 1.1 × 10^{−7} | 7 × 10^{−8} | 20 |

1 | 7.52 | 7.48 | 68.3 | 66.8 | ∼0 | 9.82 | 6 |

2 | 3.41 × 10^{−1} | 3.00 × 10^{−1} | 7.61 | 4.70 | 2.02 | 0.62 | 27 |

2 | 8.12 × 10^{−1} | 7.61 × 10^{−1} | 21.4 | 13.2 | 1.60 | 2.04 | 72 |

3 | 4.55 × 10^{−2} | 4.70 × 10^{−2} | 9.61 × 10^{−1} | 1.00 | 1.60 | 0.66 | 42 |

3 | 5.36 × 10^{−2} | 5.50 × 10^{−2} | 9.05 × 10^{−1} | 1.37 | 1.43 | 1.21 | 57 |

4 | 3.97 × 10^{−3} | 5.36 × 10^{−3} | 8.52 × 10^{−2} | 2.07 × 10^{−1} | 0.95 | 0.51 | 70 |

4 | 4.07 × 10^{−3} | 5.55 × 10^{−3} | 8.45 × 10^{−2} | 2.02 × 10^{−1} | 0.95 | 0.69 | 29 |

5 | 8.27 × 10^{−4} | 1.61 × 10^{−3} | 9.23 × 10^{−3} | 8.87 × 10^{−2} | 0.71 | 0.52 | 93 |

5 | ≥8.31 × 10^{−3} | … | ≥1.40 × 10^{−1} | … | ∈ [0.0985, 2.01] | ∈ [2.43, 3.06] | 7 |

6 | 5.92 × 10^{−4} | 1.90 × 10^{−3} | 3.03 × 10^{−3} | 1.05 × 10^{−1} | 1.54 | 0.31 | 38 |

6 | ≥5.92 × 10^{−4} | … | ≥2.80 × 10^{−3} | … | ∈ [0.126, 2.72] | ∈ [0.244, 2.98] | 62 |

. | Training set RMSD . | Test set RMSD . | Training set max . | Test set max . | . | . | . |
---|---|---|---|---|---|---|---|

Polynomial order . | (kcal/mol) . | (kcal/mol) . | error (kcal/mol) . | error (kcal/mol) . | k_{SO} (Å^{−1})
. | k_{OO} (Å^{−1})
. | Frequency . |

1 | 6.98 | 7.02 | 50.6 | 74.2 | 19.15 | 8.02 | 71 |

1 | 7.16 | 7.08 | 73.7 | 88.8 | 18.06 | ∼0 | 2 |

1 | 7.42 | 7.48 | 61.8 | 86.5 | 1.1 × 10^{−7} | 7 × 10^{−8} | 20 |

1 | 7.52 | 7.48 | 68.3 | 66.8 | ∼0 | 9.82 | 6 |

2 | 3.41 × 10^{−1} | 3.00 × 10^{−1} | 7.61 | 4.70 | 2.02 | 0.62 | 27 |

2 | 8.12 × 10^{−1} | 7.61 × 10^{−1} | 21.4 | 13.2 | 1.60 | 2.04 | 72 |

3 | 4.55 × 10^{−2} | 4.70 × 10^{−2} | 9.61 × 10^{−1} | 1.00 | 1.60 | 0.66 | 42 |

3 | 5.36 × 10^{−2} | 5.50 × 10^{−2} | 9.05 × 10^{−1} | 1.37 | 1.43 | 1.21 | 57 |

4 | 3.97 × 10^{−3} | 5.36 × 10^{−3} | 8.52 × 10^{−2} | 2.07 × 10^{−1} | 0.95 | 0.51 | 70 |

4 | 4.07 × 10^{−3} | 5.55 × 10^{−3} | 8.45 × 10^{−2} | 2.02 × 10^{−1} | 0.95 | 0.69 | 29 |

5 | 8.27 × 10^{−4} | 1.61 × 10^{−3} | 9.23 × 10^{−3} | 8.87 × 10^{−2} | 0.71 | 0.52 | 93 |

5 | ≥8.31 × 10^{−3} | … | ≥1.40 × 10^{−1} | … | ∈ [0.0985, 2.01] | ∈ [2.43, 3.06] | 7 |

6 | 5.92 × 10^{−4} | 1.90 × 10^{−3} | 3.03 × 10^{−3} | 1.05 × 10^{−1} | 1.54 | 0.31 | 38 |

6 | ≥5.92 × 10^{−4} | … | ≥2.80 × 10^{−3} | … | ∈ [0.126, 2.72] | ∈ [0.244, 2.98] | 62 |

Although the number of linear parameters comprising the PIP grows dramatically with the order of the polynomial, the training and test set RMSDs show a dramatic tapering off of improvement with increasing *k*. This is depicted clearly in Fig. 8, which shows the training and test set RMSDs for the best fits listed in Table II. Note also the clear indication of overfitting in the sixth-order PIP fit, which has a greater test set RMSD than the fifth-order PIP fit (1.90 × 10^{−3} kcal/mol vs 1.61 × 10^{−3} kcal/mol). Independent of overfitting considerations, the plateauing performance of the fits with respect to the number of linear parameters suggests the existence of inherent limitations to the level of accuracy that can be achieved with PIP-based PEFs, an issue which will be discussed subsequently.

### D. Training sets: Impact of underlying sequence on resulting fits

Given the general need to efficiently sample high-dimensional configuration spaces, we have compared the use of both pseudorandom and quasirandom sequences in the generation of our training sets, with the expectation that the quasirandom sequences should lead to converged fits using fewer training set configurations. In particular, we have employed Sobol’s sequence,^{65,66} which has been found to perform well in comparison with Faure, Holton, and Niederreiter sequences in the context of quasi-Monte Carlo integration.^{67–69} We add that this is something of an intuition-based expectation held in spite of the fact that low-discrepancy sequences have been developed for the express purpose of Monte Carlo-based numerical integration (quasi-Monte Carlo) and not for polynomial fitting or interpolation. In fact, there are several conceivable disadvantages to the use of quasirandom sequences. First, pseudorandom sequences are more convenient to work with owing to the overall structure and deterministic nature of the low-discrepancy sequences. For instance, Sobol’s sequence fills/samples the unit hypercube most uniformly when the length of the sequence is 2^{n}, $n\u2208N$, and when the first 2^{m} points, *m* ≤ *n*, are skipped.^{70} This feature should ideally be taken into consideration when generating a training set and should also elicit some hesitation and consideration before manipulating an existing training set (e.g., pruning or attempting cross-validation). Second, the deterministic nature of the low-discrepancy sequences, coupled with the previous point, leads to training/test sets of the same size (number of configurations) and dimension having highly correlated structures, i.e., sets generated at distinct temperatures *T*_{1} and *T*_{2} using the same (Sobol) sequence will have very similar, correlated relative displacements of atoms when comparing the ordered configurations one-to-one. Third, the distribution of nearest-neighbor distances in a quasirandom sequence may be inferior to that of a pseudorandom sequence for the purposes of training set generation and PEF fitting. Fourth, multidimensional quasirandom sequences are known to potentially exhibit a pathologically high degree of correlation between/among different components (see, for example, Ref. 67). In short, we reiterate that low-discrepancy sequences were designed for the express purpose of Monte Carlo-based numerical integration and that their application to other purposes should respect this fact. Still, we expect that the more efficient space-filling properties of the quasirandom sequences should outweigh the disadvantages as the dimensionality of the problem increases.

To test the validity of this supposition, a collection of pseudorandom-sequence– and quasirandom-sequence–based training sets (denoted as “P” and “Q,” respectively) were generated for sulfate, ranging in size from 64 configurations to 8192 configurations. For each training set and PIP of a given order (*k* = 1 through 6), a total of 100 fits were attempted, and the fit having the lowest RMSD with respect to the training set was identified and applied to the test set consisting of 2048 configurations generated using a particular pseudorandom sequence. The RMSD (in kcal/mol) of each of the best fits applied to the test set are reported in Fig. 9. Even for the relatively simple case of sulfate, the use of quasirandom sequences was *generally* found to lead to better performance of the resulting fits (lower RMSDs with respect to the test set) when a smaller training set is used, and with the test set RMSD improving more consistently with increasing training set size. As one would expect, the performance of the fits resulting from the P and Q training sets increasingly coincides as the number of training set configurations is increased. Side-by-side comparison of the *k* = 4 and 5 results suggests considerable overfitting resulting from the 256- and 512-configuration training sets for *k* = 5, though the overfitting is much less severe when using the quasirandom sequence.

Note that the comparison of the P and Q solutions for a given training set size is in some cases complicated by the fact that the two solutions may actually correspond to two different minima of the overall solution space. For example, the best *k* = 1 solution for the 128-configuration P training set is (*k*_{SO}, *k*_{OO}) = (12.7 Å^{−1}, 16.6 Å^{−1}), while the best solution for the 128-configuration Q training set is (*k*_{SO}, *k*_{OO}) = (18.2 Å^{−1}, 0.0 Å^{−1}). This is, however, not typically the case, and the RMSD surfaces calculated using a pseudorandom-sequence–based training set are qualitatively the same as those calculated using the quasirandom-sequence–based training set. As an example, RMSD surfaces calculated using pseudorandom-sequence– and quasirandom-sequence–based training sets consisting of 512 configurations each can be found in the supplementary material, as well as RMSD surfaces calculated using a pseudorandom-sequence–based training set consisting of 8192 configurations, analogous to Fig. 7. For the 512-configuration training sets, the RMSD surfaces corresponding to the quasirandom-sequence–based training set have higher RMSDs overall, a reflection of being more difficult for the model PEF to describe due to broader and more complete sampling of the harmonic distribution. For the smaller 512-configuration training sets, one may also expect the RMSD surfaces corresponding to the quasirandom-sequence–based training set to be less shallow overall, with larger Hessians eigenvalues at the minima, though this cannot be objectively determined from Figs. S2 and S3 of the supplementary material. For the larger 8192-configuration training sets, there is no noteworthy difference between the RMSD surfaces resulting from the pseudorandom-sequence– and quasirandom-sequence–based training sets, as the more efficient sampling of the quasirandom sequence becomes less discernible from the sampling of the pseudorandom sequence with increasing sequence size.

The volatility of the *k* = 1 and 2 results reveals a surprising sensitivity of the solution to the training set for the simplest polynomial expressions. To reframe this finding, the data in Fig. 9 have been scaled to show the RMSD vs the number of training set configurations *per linear parameter* in Fig. 10. Note that in Fig. 10 some data points are located beyond the *x*-axis maximum in cases where the RMSDs for the “P” and “Q” results have effectively reached the same constant value. Surprisingly, we find that the less complex and more rigid the polynomial used (smaller *k*), the more training set configurations that are needed per linear parameter in order to achieve a “converged” test set RMSD and a consistent solution. One may then think of each term of a lower-order PIP as containing more information, on average, than each term of a higher-order PIP, a notion which is supported by the results of Subsection III E. This would appear to be a direct consequence of the decreasing range of successive powers of the Morse-type variables, and by obvious extension, decreasing range of the additional monomial terms with increasing order.

### E. Numerical limitations to accuracy (singular value decomposition, regularization)

The distribution of singular values corresponding to the converged, minimum-RMSD first-through sixth-order fits for sulfate are shown in Fig. 11. A training set consisting of 8192 configurations generated using a quasirandom sequence was used. It can be readily appreciated that (1) the ill-conditionedness of the problem, indicated by *σ*_{max}/*σ*_{min}, increases dramatically as the number of terms in the polynomial increases, (2) the ordered singular values decrease exponentially, and (3) there are no appreciable gaps between *σ*_{n} and *σ*_{n+1} in any of the plots, save perhaps for those seen at very small and very large *n*. Given that each polynomial basis of a given order is a subset of each higher-order polynomial basis, i.e., from Eq. (5),

one might expect the distributions of ordered, scaled singular values, *σ*_{n}/*σ*_{max}, to be roughly equivalent over the range of coexisting indices. Indeed, Fig. 12 shows this to be true.

The straightforward implication of observation (1) above is that it becomes objectively more difficult to obtain a stable numerical solution as the number of terms is increased. Combined with the observation that the distributions of scaled singular values are essentially superimposable, observation (2) indicates that a relatively small number of terms, the lower-order terms, are responsible for a relatively large contribution to the accurate representation of the PES. Finally, the implication of observation (3) is that there is no clear distinction between more meaningful and less meaningful (or “noisy”) contributions to the solution, and hence, no “natural” choice of regularization parameter.

Supposing we adopt an upper bound of 10^{8} on the condition number, we find that the number of singular values within this threshold is 28 of 29 for *k* = 3, 48 of 82 for *k* = 4, 55 of 207 for *k* = 5, and 59 of 494 for *k* = 6. This is to say that beyond *k* = 4, as the number of terms in the polynomial is increased, the majority of additional singular values will fall beyond the designated “noise” threshold. Note that due to the exponential decrease in the singular values, this argument holds more generally for other thresholds that one may designate. Moreover, the exponential decrease in the singular values obeyed for all *k* greatly outpaces the growth in the number of terms with increasing *k*, making even their aggregate contribution effectively negligible beyond some value of *k*. In addition, we find that the smallest scaled singular values for *k* = 6 are numerically effectively zero, indicating that *A* is effectively singular for this value of *k*. This suggests that the number of terms needed to represent the information contained in the training set has been exceeded, as we have seen in Secs. III B and III C. The use of quasirandom vs pseudorandom sequences was not found to affect the distribution of singular values upon convergence nor at the initial or intermediate stages of the fitting procedure, i.e., the ill-conditionedness of the problem was not impacted by the type of sequence used to generate the training set.

While the precise details of these observations should depend on the size and scope of the training set, it seems that they should hold generally for any finite-size training set. Singular value distributions for a training set consisting of 512 configurations were found to exhibit the same key features as those shown in Fig. 11 and can be found in the supplementary material.

### F. Effect of regularization

As we have mentioned, the choice of regularization parameter value is a nontrivial one owing to the absence of a natural gap in the singular values, which would suggest a natural threshold value. When this is the case, empirical testing of various regularization parameter values becomes essential. While *k*-fold cross-validation would represent a more thorough approach, we settle here for a simple comparison using a single training and test set, each consisting of 8192 configurations. For each polynomial of order *k* = 4 through 6, and each value of regularization parameter *α*, a total of 100 fits were attempted. The fit having the lowest RMSD with respect to the training set was selected and applied to the test set. The resulting RMSDs are summarized in Table III. In each case, the results suggest that the optimal choice of *α* should depend on the order of the polynomial, with *α* = 10^{−6}, 10^{−8}, and 10^{−10} resulting in the lowest test set RMSDs for *k* = 4, 5, and 6, respectively. With these choices of *α*, the polynomials show improved performance with respect to the test set, without any obvious sign of overfitting. While the need for regularization is perhaps not very compelling in this case, one could expect it to play a more impactful role in situations involving more complicated molecular species or when fitting two- and three-body terms, and also in scenarios where the training set is inadequate in some regard.

. | Training set RMSD (kcal/mol) . | Test set RMSD (kcal/mol) . | ||||
---|---|---|---|---|---|---|

. | k = 4
. | k = 5
. | k = 6
. | k = 4
. | k = 5
. | k = 6
. |

α = 10^{0} | 6.931 52 | 4.958 44 | 3.458 44 | 6.784 41 | 4.822 19 | 3.381 67 |

α = 10^{−1} | 1.353 79 | 0.954 98 | 0.755 99 | 1.339 85 | 0.947 11 | 0.753 17 |

α = 10^{−2} | 0.442 65 | 0.343 19 | 0.276 94 | 0.434 03 | 0.340 18 | 0.278 00 |

α = 10^{−3} | 0.131 19 | 0.097 60 | 0.085 34 | 0.129 85 | 0.096 88 | 0.082 93 |

α = 10^{−4} | 0.054 07 | 0.038 37 | 0.022 55 | 0.052 40 | 0.038 59 | 0.024 46 |

α = 10^{−5} | 0.010 79 | 0.005 29 | 0.004 26 | 0.012 33 | 0.005 99 | 0.004 75 |

α = 10^{−6} | 0.004 08 | 0.002 74 | 0.001 90 | 0.004 44 | 0.003 44 | 0.002 62 |

α = 10^{−7} | 0.003 98 | 0.001 07 | 0.000 92 | 0.004 52 | 0.001 48 | 0.001 18 |

α = 10^{−8} | 0.003 97 | 0.000 86 | 0.000 73 | 0.004 55 | 0.001 03 | 0.000 94 |

α = 10^{−9} | 0.003 97 | 0.000 84 | 0.000 63 | 0.004 55 | 0.001 06 | 0.000 83 |

α = 10^{−10} | 0.003 97 | 0.000 83 | 0.000 60 | 0.004 55 | 0.001 05 | 0.000 77 |

α = 10^{−11} | 0.003 97 | 0.000 83 | 0.000 60 | 0.004 55 | 0.001 04 | 0.000 87 |

α = 10^{−12} | 0.003 97 | 0.000 83 | 0.000 59 | 0.004 55 | 0.001 06 | 0.000 93 |

α = 10^{−13} | 0.003 97 | 0.000 83 | 0.000 59 | 0.004 55 | 0.001 06 | 0.001 07 |

α = 10^{−14} | 0.003 97 | 0.000 83 | 0.000 59 | 0.004 55 | 0.001 06 | 0.001 19 |

α = 10^{−15} | 0.003 97 | 0.000 83 | 0.000 59 | 0.004 55 | 0.001 06 | 0.001 09 |

. | Training set RMSD (kcal/mol) . | Test set RMSD (kcal/mol) . | ||||
---|---|---|---|---|---|---|

. | k = 4
. | k = 5
. | k = 6
. | k = 4
. | k = 5
. | k = 6
. |

α = 10^{0} | 6.931 52 | 4.958 44 | 3.458 44 | 6.784 41 | 4.822 19 | 3.381 67 |

α = 10^{−1} | 1.353 79 | 0.954 98 | 0.755 99 | 1.339 85 | 0.947 11 | 0.753 17 |

α = 10^{−2} | 0.442 65 | 0.343 19 | 0.276 94 | 0.434 03 | 0.340 18 | 0.278 00 |

α = 10^{−3} | 0.131 19 | 0.097 60 | 0.085 34 | 0.129 85 | 0.096 88 | 0.082 93 |

α = 10^{−4} | 0.054 07 | 0.038 37 | 0.022 55 | 0.052 40 | 0.038 59 | 0.024 46 |

α = 10^{−5} | 0.010 79 | 0.005 29 | 0.004 26 | 0.012 33 | 0.005 99 | 0.004 75 |

α = 10^{−6} | 0.004 08 | 0.002 74 | 0.001 90 | 0.004 44 | 0.003 44 | 0.002 62 |

α = 10^{−7} | 0.003 98 | 0.001 07 | 0.000 92 | 0.004 52 | 0.001 48 | 0.001 18 |

α = 10^{−8} | 0.003 97 | 0.000 86 | 0.000 73 | 0.004 55 | 0.001 03 | 0.000 94 |

α = 10^{−9} | 0.003 97 | 0.000 84 | 0.000 63 | 0.004 55 | 0.001 06 | 0.000 83 |

α = 10^{−10} | 0.003 97 | 0.000 83 | 0.000 60 | 0.004 55 | 0.001 05 | 0.000 77 |

α = 10^{−11} | 0.003 97 | 0.000 83 | 0.000 60 | 0.004 55 | 0.001 04 | 0.000 87 |

α = 10^{−12} | 0.003 97 | 0.000 83 | 0.000 59 | 0.004 55 | 0.001 06 | 0.000 93 |

α = 10^{−13} | 0.003 97 | 0.000 83 | 0.000 59 | 0.004 55 | 0.001 06 | 0.001 07 |

α = 10^{−14} | 0.003 97 | 0.000 83 | 0.000 59 | 0.004 55 | 0.001 06 | 0.001 19 |

α = 10^{−15} | 0.003 97 | 0.000 83 | 0.000 59 | 0.004 55 | 0.001 06 | 0.001 09 |

## IV. CONCLUSIONS

We have analyzed a number of features of the PIP-based PEF fitting problem, including the nature of the loss surface, impact of various training set attributes on the resulting fits, and important numerical features. While we have chosen sulfate as a case study, many of the observations can be reasonably generalized. In addition, we have introduced an approach to training set generation based on normal mode sampling. This sampling approach offers a large degree of control, allowing one to choose, for example, the underlying sequence to be used; to selectively sample particular modes (which may be desirable for large and/or rigid systems, depending upon the scientific questions to be addressed with the resulting PEF); temperature (effectively breadth of sampling); and use of classical or quantum distributions. At the same time, the approach still lends itself well to automation if one chooses a reasonable fixed temperature or range of temperatures as part of a standard protocol, or perhaps a system-dependent standard, e.g., one defined with respect to the eigenfrequencies of each system. An obvious limitation of this approach to generating training sets is that the number of possible minima, transition states, and saddles to be sampled about grows rapidly with the size of the system. This is, however, a fundamental challenge that any approach to the generation of training sets will have to contend with to some degree, either directly or indirectly. This approach to sampling can of course be applied to clusters, making it useful for generating portions of an *n*-body training set as well. Regarding the choice of underlying sequence to sample with, even for the relatively simple and low-dimensional case of sulfate, whose loss surface is characterized by one or two minima when using the training sets discussed in this work, the use of quasirandom (Sobol) sequences was found to have a noteworthy benefit, generally leading to more regular convergence to the optimum solution with respect to training set size. Benefits to the use of quasirandom sequences for training set generation are expected to be more pronounced for larger systems and warrant further investigation.

While the least-squares and Nelder–Mead algorithms themselves are highly robust and will effectively always locate *a* solution in practice (supposing an acceptable training set has been provided), the ability to reliably locate the *global* minimum is dependent on a number of factors and generally becomes increasingly difficult with increasing system size. Considering sulfate as a particular example, we have found the problem to be very tractable with respect to the complexity of the loss surface, and there is reason to expect that this tractability should also be the case more generally. At the same time, it seems clear that there is an inherent numerical limitation to the level of accuracy that can be achieved with the use of PIPs in their current form.

Independent of the numerical limitations that have been discussed in this work, the quasiexponential growth in the size of the PIP fitting basis and factorial growth in the number of permutations of like atoms renders the use of this approach to PEF fitting prohibitively expensive, if not impossible, for handling larger molecular species in its current state. An ideal approach would enable one to effectively decompose larger molecules into smaller fragments based on relevant structural, energetic, or other considerations and combine fits obtained for these fragments in an appropriate manner.^{71}

Follow-up work will entail further validation of the training set generation and fitting procedure outlined here for sulfate and other molecules, e.g., optimized geometries, harmonic frequencies, and performance of the fit on configurations obtained from MD simulations. In addition, features of the training and fitting procedure addressed here for the one-body fitting problem will be explored for the substantially more challenging two-body PIP-based PEF fitting problem.

## SUPPLEMENTARY MATERIAL

“Permutational assignments” for the systems and PEFs presented in Table I, RMSD surfaces for nitrate PIP-based PEFs, additional RMSD surfaces for sulfate PIP-based PEFs, and additional unscaled and scaled singular value distributions arising from fitting PIP-based PEFs for sulfate are given in the supplementary material.

## ACKNOWLEDGMENTS

We would like to thank Francesco Paesani, without whom this work would not have been possible, and Marc Riera and Andreas Götz for many helpful discussions. This research was supported by the Air Force Office of Scientific Research through Grant No. FA9550-16-1-0327, including the Department of Defense High Performance Computing Modernization Program. Calculations were performed on the Triton Shared Computing Cluster at the San Diego Supercomputer Center.

## REFERENCES

*Ab Initio*Molecular Potential Energy Surface Construction and Molecular Dynamics Simulation for Small Molecules,