Density functional theory has been widely used in quantum mechanical simulations, but the search for a universal exchange-correlation (XC) functional has been elusive. Over the last two decades, machine-learning techniques have been introduced to approximate the XC functional or potential, and recent advances in deep learning have renewed interest in this approach. In this article, we review early efforts to use machine learning to approximate the XC functional, with a focus on the challenge of transferring knowledge from small molecules to larger systems. Recently, the transferability problem has been addressed through the use of quasi-local density-based descriptors, which are rooted in the holographic electron density theorem. We also discuss recent developments using deep-learning techniques that target high-level *ab initio* molecular energy and electron density for training. These efforts can be unified under a general framework, which will also be discussed from this perspective. Additionally, we explore the use of auxiliary machine-learning models for van der Waals interactions.

## I. INTRODUCTION

In 1964, Hohenberg and Kohn proved the unique mapping between the ground state electron density and local potential, besides an overall constant.^{1} This insight led to the Kohn–Sham formulation of density functional theory and the notion of exchange-correlation (XC) energy functional, introduced by Kohn and Sham in 1965.^{2} The Kohn–Sham approach provides a way to transform the many-electron problem into an equivalent one-electron problem with an effective potential. The search for the universal XC functional has resulted in a variety of approximate XC functionals since then. However, the universal XC energy functional has remained elusive. Although a universal analytical form for the XC functional is believed to be impractical, the search for a universal XC functional remains an active area of research. For the state-of-the-art of DFT, we refer to the reader to Ref. 3.

Machine learning (ML) has been applied to construct the XC functional in DFT since 1996 when Handy *et al.* proposed a machine learning approach to map the local electron density to the local XC potential.^{4} In 2004, Zheng *et al.* independently used a neural network to construct an improved XC energy functional based on the functional form of B3LYP.^{5} With the success of deep learning in computer vision,^{6} natural language processing,^{7} and other fields,^{8,9} there is growing interest in using deep learning algorithms, such as convolution neural networks (CNNs),^{10} graph neural networks (GNNs),^{11} and transformers,^{12} to approximate the universal XC functional.

Specifically, efforts to develop machine-learning based (MLB) XC functional or potential can be categorized into several types: (i) MLB XC potential,^{13,14} (ii) MLB XC energy functional,^{5,15–27} (iii) MLB XC energy density,^{16} and (iv) MLB XC energies of fragments.^{27–39} In addition to the XC functional or potential, other aspects of the DFT framework can also benefit from ML techniques. For instance, the kinetic energy functional has been proposed.^{40–48} Furthermore, ML has been extensively used to fit or construct potential energy surfaces,^{49} where DFT is frequently used as a training target or benchmark for the ML algorithms. Besides the above-mentioned works on MLB XC functionals, researchers have employed other data-driven techniques rather than deep learning (such as genetic algorithms) to seek accurate forms of XC functionals.^{50,51} For example, in Ref. 50, the authors have proposed a Symbolic Functional Evolutionary search to construct accurate XC functionals in the symbolic form.

In this perspective, we focus on the construction of MLB XC functional or potential and review various methodologies, from the early approaches to the latest developments.^{52} It is important to note that our goal in this article is not to be exhaustive. Rather, we aim to explore specifically how machine learning techniques can be used to construct XC functionals or potentials and answer the question of whether the universal functional can be accurately obtained via deep learning.

## II. EXCHANGE-CORRELATION FUNCTIONAL AND POTENTIAL

^{1}forms the basis for predicting the quantum mechanical properties of a many-electron system from its electron density, implying that the ground state energy is a unique functional of the electron density (denoted as

*ρ*(

**r**)). By introducing a non-interacting reference system, Kohn and Sham

^{2}expressed the ground state energy functional

*E*[

*ρ*(

**r**)] as follows:

*T*

_{s},

*E*

_{ext},

*E*

_{H}, and

*E*

_{xc}stand for the Single-Slater kinetic energy with a set of orbitals {

*ϕ*

_{i}}, the external energy, the Hartree energy, and the exchange correlation energy, respectively. The terms

*T*and

*E*

_{ee}are the exact kinetic energy and the Coulomb energy for the many-electron interacting system, respectively. Minimizing the total energy constrained by normalized orbitals leads to the following Kohn–Sham (KS) equations:

*v*

_{xc}is called the exchange-correlation potential, which is the functional derivative of the exchange-correlation energy with respect to the electron density,

*ρ*(depending on the orbitals

*ϕ*

_{i}’s), and thus, it is a nonlinear eigenvalue problem. To solve this problem, an initial density

*ρ*

_{0}must be provided, and the solution must be updated until convergence is reached, a process known as self-consistent field (SCF) calculation.

^{53}

A key starting point for using ML techniques within DFT is to parameterize the XC energy functional (or potential, or even the corresponding energy density) using various ML architectures, such as neural networks,^{54} and train the model with carefully designed descriptors for input and training data. This is referred to as the ML-DFT method, and the ML architecture is termed the ML-DFT model in this perspective. Descriptors should be the functions or functionals of electron density. Below, we review the existing ML-DFT methodologies based on different types of descriptors used in modeling.

## III. EARLY WORKS ON XC MODELS

Prior to the recent research to construct XC functional or potential via ML, two research groups employed neural networks to search for the XC functional and potential, and the two pioneering publications^{4,5} were published in 1996 and 2004, respectively. The electron density was used for the descriptors, and the output was the XC potential or functional.

### A. Neural network-based B3LYP functional

^{55}includes five pure functional terms: (i) the Slater exchange functional $EXSlater[\rho ]$;

^{56}(ii) the Hartree–Fock exchange functional $EXHF[\rho ]$;

^{57}(iii) the difference between the Becke88 exchange

^{58}and the Slater functionals, denoted as $\Delta EXBecker[\rho ]=EXB88[\rho ]\u2212EXSlater[\rho ]$; (iv) the Lee–Yang–Parr correlation functional $ECLYP$;

^{55}and (v) the Vosko–Wilk–Nusair correlation functional $ECVWN$.

^{59}The B3LYP functional is tuned by three coefficients:

*a*

_{0},

*a*

_{X}, and

*a*

_{C}; it reads as follows:

In hybrid functionals like B3LYP, the coefficients are typically determined by fitting to experimental data or accurate calculations, and once obtained, they are treated as constants. In B3LYP, the values are *a*_{0} = 0.8, *a*_{X} = 0.72, and *a*_{C} = 0.81, based on fitting a set of atomization energies and ionization potential.^{58} See also Ref. 60 for calibration and selection of hybrid density functionals using Bayesian optimization techniques.

*et al.*

^{5}proposed to project the exact XC functional onto the B3LYP functional and pointed out that

*a*

_{0},

*a*

_{X}, and

*a*

_{C}should, in theory, be system-dependent or functional of electron density. By making these coefficients as functionals of density, the exact XC functional can be expressed as

*et al.*

^{5}proposed a neural network with five descriptors as inputs a single hidden layer. The outputs of the ML model are the coefficients

*a*

_{0}[

*ρ*],

*a*

_{X}[

*ρ*], and

*a*

_{C}[

*ρ*]. The resulting XC functional is used in the KS-SCF calculations. Next, we briefly discuss the computation of the XC potential used in the SCF calculation. The exact XC potential reads as follows:

*ρ*too much, that is,

^{−1}, while the NN-based functional gives 2.9 kcal mol

^{−1}(see Table I below). However, the resulting MLB XC functional is not as accurate due to the above approximation that the functional derivatives of

*a*

_{0},

*a*

_{X}, and

*a*

_{C}are zero.

RMS errors (all data are in the units of kcal mol^{−1})
. | |||||
---|---|---|---|---|---|

Properties . | AE . | IP . | PA . | TAE . | Overall . |

Number of samples . | 56 . | 42 . | 8 . | 10 . | 116 . |

Aa | 2.9 | 3.9 | 1.9 | 4.1 | 3.4 |

DFT-1b | 3.0 | 4.9 | 1.6 | 10.3 | 4.7 |

DFT-NNc | 2.4 | 3.7 | 1.6 | 2.7 | 2.9 |

RMS errors (all data are in the units of kcal mol^{−1})
. | |||||
---|---|---|---|---|---|

Properties . | AE . | IP . | PA . | TAE . | Overall . |

Number of samples . | 56 . | 42 . | 8 . | 10 . | 116 . |

Aa | 2.9 | 3.9 | 1.9 | 4.1 | 3.4 |

DFT-1b | 3.0 | 4.9 | 1.6 | 10.3 | 4.7 |

DFT-NNc | 2.4 | 3.7 | 1.6 | 2.7 | 2.9 |

^{a}

Becke’s work.

^{b}

Conventional B3LYP/6-311+G(3df, 2p).

^{c}

Neural-Networks-based B3LYP/6-311+G(3df, 2p).

### B. Neural network-based XC potential model

In 1996, Tozer *et al.*^{4} proposed a neural network architecture that mapped local electron density to the corresponding local XC potential. The method is classified as *local descriptor-based* due to its single density input. It is expected to achieve improvements if the information from higher order derivatives of densities is employed. In Ref. 4, the input densities were calculated at the CCSD level (with Brueckner coupled cluster method^{61}) and the model consisted of one fully connected layer with eight hidden neurons; while the target XC potentials were computed by the Zhao–Morrison–Parr (ZMP) method.^{62} Trainings were performed on (*ρ*, *v*_{xc}) pairs from either one molecule or multiple atoms/molecules. The ML-DFT model was used to perform KS-SCF calculations, resulting in significant improvements over LDA (see the column of CNN in Table II for the numerical performance of the method). These improvements can be enhanced further by including more information from the neighboring area of the local point, for instance, by adding first and higher-order derivatives of electron density in the descriptors, as pointed out by Tozer *et al.* in Ref. 4.

. | LDA . | CNN . | −I
. |
---|---|---|---|

Nea | −0.492 | −0.660 | −0.792 |

HFa | −0.350 | −0.525 | −0.590 |

$N2a$ | −0.380 | −0.560 | −0.573 |

H_{2}Oa | −0.261 | −0.441 | −0.463 |

$H2a$ | −0.369 | −0.550 | −0.567 |

CO | −0.333 | −0.519 | −0.515 |

F_{2} | −0.347 | −0.516 | −0.577 |

CH_{4} | −0.346 | −0.535 | −0.460 |

NH_{3} | −0.222 | −0.404 | −0.373 |

C_{2}H_{2} | −0.270 | −0.461 | −0.419 |

O_{3} | −0.293 | −0.468 | −0.457 |

LiH | −0.159 | −0.422 | −0.283 |

Li_{2} | −0.120 | −0.394 | −0.188 |

. | LDA . | CNN . | −I
. |
---|---|---|---|

Nea | −0.492 | −0.660 | −0.792 |

HFa | −0.350 | −0.525 | −0.590 |

$N2a$ | −0.380 | −0.560 | −0.573 |

H_{2}Oa | −0.261 | −0.441 | −0.463 |

$H2a$ | −0.369 | −0.550 | −0.567 |

CO | −0.333 | −0.519 | −0.515 |

F_{2} | −0.347 | −0.516 | −0.577 |

CH_{4} | −0.346 | −0.535 | −0.460 |

NH_{3} | −0.222 | −0.404 | −0.373 |

C_{2}H_{2} | −0.270 | −0.461 | −0.419 |

O_{3} | −0.293 | −0.468 | −0.457 |

LiH | −0.159 | −0.422 | −0.283 |

Li_{2} | −0.120 | −0.394 | −0.188 |

^{a}

In the training set.

To summarize, the approach developed in Ref. 5 pioneers the research direction on constructing the XC functional using machine learning while the approach developed in Ref. 4 is the first work targeting directly the XC potential. The method in Ref. 4 uses information from local electron densities as the descriptor and we term it as the local descriptor-based method,^{63} and obviously, it is only an approximation. The numerical scheme constructed in Ref. 5 intends to use the entire electron density of a molecule as the descriptors, and we term it as the global descriptor-based method. The global descriptor-based method can be, in principle, exact. In Sec. IV, we review, first, the recent studies on global descriptor-based methods utilizing more advanced machine-learning architectures.

## IV. RECENT WORKS ON GLOBAL DESCRIPTOR-BASED XC MODELS

### A. Deep neural networks for XC potential

The work by Nagai *et al.*^{13} investigated the idea of incorporating a neural-network trained XC potential model in the KS-SCF calculation. Specifically, this approach makes use of a fixed grid with 100 consecutive and equally spaced points to feed the entire density as a vector to a fully connected neural network with two 300-neuron hidden layers, mapping the entire electron density to the target XC potential [see Fig. 1 (left) for the algorithmic procedure of the numerical scheme]. Once the XC potential model is trained and established, one can solve the Kohn–Sham equation, and the initial XC potential is produced via the neural-network trained XC potential model with the initial electron density as the descriptor. The total energy can also be evaluated.

The proposed method was tested in a 1-D model system consisting of two interacting spinless fermions with various random Gaussian external potentials. The target potential was set to be the total Coulomb potential $vHxc=vH+vxc$ with *v*_{H} = −*A* exp(−*x*^{2}/*B*^{2}) being the Hartree potential with two parameters *A* and *B*; the corresponding density was calculated using exact diagonalization.

In Fig. 1 (right), the two columns show (as color maps) the out-of-training error in density and total energy derived from the KS scheme with the trained potentials. The horizontal and vertical axes represent the ranges of the parameters *A* and *B*, respectively. Overall, the trained neural network model demonstrated good generalizability in out-of-sample tests with unseen external potentials within the simple setup.

### B. Projection-based XC potential and energy model

*et al.*

^{28}employed a machine learning method to predict the DFT or CCSD energies (or the correction to a standard DFT calculation) from DFT densities. In principle, they utilized a periodic Fourier basis set comprising 12 500 functions to perform a projection and represent each molecular density as follows:

*u*

^{(ℓ)}and

*ϕ*

_{ℓ}are the

*ℓ*th projection coefficient and

*ℓ*th basis function, respectively. The term

*v*denotes the external nuclear potential, which was approximated by a sum of Gaussians as in Ref. 28. The projection coefficient vector $u=(u(\u2113))\u2113=1L$ is then mapped to the target energy through the kernel ridge regression (KRR) model.

^{64}See Fig. 2(a) for its algorithmic procedure. The target energy was selected to be either the DFT energy obtained using the PBE functional, the CCSD(T) energy, or the difference between the two, which captures the exchange-correlation contribution at varying levels of accuracy.

*k*(·, ·) is a Gaussian kernel measuring the similarity between any two projected density descriptor vectors,

*σ*being a hyper-parameter determined by cross-validation.

^{65}In Eq. (8), the model

*E*

_{ML}stands for the fitted energy;

**u**[

*v*] denotes the projection coefficient vector for the external potential

*v*; and

*v*

_{i}is the

*i*th external potential in the training set. Predictions for new densities are generated by a summation of parameters weighted by the kernel function, effectively representing unknown densities by interpolating known densities, with the interpolation parameters learned by the model.

To broaden the scope of their approach, the authors also built a separate KRR model mapping external potential to density. By combining this with the model that maps density to energy, all the density functionals can be expressed as functionals of external potential. This effectively blurs the line between machine learning methods based on density functional theory and those that directly learn from molecular geometry.

### C. Kohn–Sham regularizer

Previous efforts have been made to construct XC potential models for SCF calculations. However, in those efforts, the training procedure and the SCF calculations were independent of each other. In contrast, in the work by Li *et al.*,^{21} the ML model was programmed in a fully differentiable way with the aid of automatic differentiation,^{66–70} allowing error to backpropagate through multiple iterations of the SCF calculation. In general, automatic differentiation allows efficiently computing the derivatives of any functions in the computer program, and this technique can be used, for instance, to minimize the Hartree–Fock energy (or any other objective functionals) to avoid eigenvalue calculation in an orbital-free setting.^{68}

The scheme developed by Li *et al.*^{21} effectively included more information about the functional mapping from the density to the XC energy, and the scheme was named the *Kohn–Sham regularizer* (KSR) due to its generalization (preventing overfitting) capability (see also Ref. 71 for a spin-adapted version of KSR model). Figure 3(a) depicts the computational procedure of the KSR model, which uses the electron density of the molecule as the model input. Then, the model consists of a fixed number of times (denoted *K* as the total number of iterations) of SCF iterations [see Fig. 3(b) for the internal process of the SCF iteration], where each SCF iteration is parameterized by a neural network model, whose architecture is sketched in Fig. 3(c), outputting a series of energies ${Ek}k=1K$ and ${\rho k}k=1K$. The loss function includes both energy and density loss terms. All the terms *E*_{k}’s are used to form the energy loss function while only the last term *ρ*_{K} in the charge density sequence will be used to form the density loss function. In other words, the former loss had contributions from multiple iterations, with decay weights for earlier iterations, while the latter only contains the root mean squared error between the last iteration’s output and the target.

The KSR model has shown the generalizability for a simple one-dimensional H_{2} model system, with only two training examples needed to determine the whole dissociation curve reasonably well. However, as the work was developed for 1D model systems, it still falls under the category of *proof-of-concept*. Moreover, the energy loss term contains the contributions of all produced energies *E*_{k}’s from the previous SCF iterations, and this training mechanism enforces that the output energy of the model should converge more or less exactly in the way the training labels did, which is generally not practical in the conventional SCF calculations. Extending it to realistic 3D systems will require extra effort due to the computational complexity.

## V. MODEL TRANSFERABILITY AND HOLOGRAPHIC ELECTRON DENSITY THEOREM

Although the application of high precision quantum chemistry methods, such as CCSD^{72} and quantum Monte Carlo,^{73,74} facilitates the acquisition of large amounts of data on small molecules, obtaining such an accurate dataset for large molecules from *ab initio* methods is not practical. The lack of such data for larger molecules poses a key problem to the transferability of machine-learning-based XC functionals of complex molecules. Since most of the existing ML-DFT models are trained only with the datasets of small molecules, the model’s transferability, from simple and small molecules to complicated and large ones, may pose a challenge in constructing a universal XC functional. To address this issue, the density descriptors must be carefully designed to ensure the transferability of the ML-DFT model from small molecules to large ones.

Riess and Münch^{75} posited in 1981 that the electron density distribution of a molecular system is determined by an arbitrary finite volume of the ground state electron density, based on the hypothesis that electron density functions of atomic and molecular species are real analytic in real space excluding nuclei. The validity of this hypothesis, however, was not rigorously proven until Fournais *et al.* demonstrated the real analyticity of electron density of arbitrary atomic and molecular eigenstate of the Schrödinger Equation.^{76,77} Another proof of real analyticity of electron density has been given by Jecko.^{78} The ground state holographic electron density theorem (GS-HEDT) named by Mezey^{79} is thought to be linked to the concept of quantum similarity measures in DFT.^{80,81}

In the case of an atomic and molecular system, the external potential *v*(**r**) acting on each electron is real analytic (mathematically defined) except at the nuclei. The electron density is real analytic everywhere except for isolated points where the nuclei’s point charges cause non-analyticities. Analytic functions, such as Gaussian orbitals and plane waves, are often used as basis sets for quantum mechanical calculations, resulting in real analytic electron densities. The values within a subregion is sufficient to determine values everywhere in the physical space, and this can be shown by the analytic continuation of real analytic functions, as demonstrated in Ref. 82. Zheng *et al.* have provided a simple proof for the holographic property of real analytic density in three-dimensional physical space and proposed the time-dependent holographic electron density theorem for open electronic systems, which has been applied to the study of time-dependent quantum transport problems.^{83–86} Moreover, the *nearsightedness principle* proposed by Kohn^{87} (see also Ref. 88) suggests that local electronic properties, such as the electron density, depend mostly on the external potential in the nearby regions. This principle shares the same foundation with the GS-HEDT, which also highlights the local nature of ground state electrons.

^{89}it may be possible to achieve an accurate quasi-local KS mapping through the use of advanced machine learning techniques,

*M*

_{θ}denotes the ML-DFT model with its optimized parameters denoted as

*θ*, and

*B*(

**r**) denotes a neighborhood of

**r**. The ML XC potential

*v*

_{ML-XC}(

**r**) is dependent on the electron density

*ρ*at

**r**and its neighborhood. After training, the resulting ML-DFT model for

*v*

_{ML-XC}(

**r**) can be used in SCF calculations. As dictated by the GS-HEDT, the neighborhood could be arbitrarily small in principle. However, in practice, the quasi-local region surrounding the spatial point should be of a certain finite size to ensure the numerically feasible KS mapping. In Ref. 14, a cube centered at each position with sampling points arranged along their spatial directions is a viable neighborhood choice. For instance, for a given window half-length

*h*> 0, the sampling points range from the cube,

**r**= (

*r*

_{x},

*r*

_{y},

*r*

_{z}) with a certain step length (the smaller the step, the more points are sampled given a fixed

*h*). The output of the ML-DFT model is the value of XC potential at

**r**, and, therefore, once trained, the model predicts the XC potential at position

**r**of the center of the sampling neighborhood. The entire XC potential is obtained by sweeping the model across the grid, and the output is used in the KS equation within the SCF procedure to calculate a new density.

The above ML-DFT model that uses the quasi-local electron density as the descriptors is termed the quasi-local descriptor-based XC model. In the next session, we review three different types of quasi-local descriptor-based ML-DFT models.

## VI. QUASI-LOCAL DESCRIPTOR-BASED XC MODELS

Compared to the local descriptor-based ML-DFT model such as the one proposed in Ref. 5, the quasi-local descriptor-based model can, in principle, be exact and, in practice, is certainly more accurate. This is justified by the HEDT, which states that the ground state electron density uniquely determines the ground state properties of any subdomain and of the total domain of the system. The quasi-local descriptor-based ML-DFT methods are promising.

### A. Quasi-local XC potential model

In Ref. 14, Zhou *et al.* proved the rigorous foundation of the quasi-local descriptor-based ML-DFT method and, in addition, developed and implemented its ML-DFT and subsequent KS-SCF algorithm. Quasi-local densities (input or descriptors) and XC potentials (labeled data) were discretized on a grid whose points coincide with the set of quadrature points for potential integration. A convolution neural network (CNN)^{90} architecture was employed with the input being a cube of sampled density, and the final output of the model is a scalar value of the XC potential at the respective quadrature point. The resultant ML XC potential is integrated and used for SCF calculations later on.

The ML-DFT model is a 3D CNN neural network, as depicted in Fig. 4. It was tested on H_{2} and HeH^{+} and trained on a dataset of 50 H_{2} molecules and 50 HeH^{+} ions (with bond lengths ranging from 0.504 to 0.896 Å). The ground state electron density is used as the input or descriptor and was calculated by employing CCSD(T). The target or output is the XC potential, which was calculated using Wu-Yang method^{91,92} (see Appendix B for a brief introduction).

This ML-DFT model outperforms traditional DFT using B3LYP in terms of electron density accuracy by at least one order of magnitude, as demonstrated by benchmarking with the reference CCSD electron density. When integrated into the SCF procedure, the ML XC potential achieves impressive performance on the electron density, surpassing B3LYP by up to two orders of magnitude. In Fig. 5(a), HeH^{+} electron density calculated with the ML-DFT method is compared with B3LYP, and the reference data are the CCSD(T) electron density. With the predicted electron density, an atomic force can be calculated using Hellman–Feynman theorem^{93} and basis set correction.^{94} The accuracy is significantly better than that of B3LYP.

Figure 5(b) shows that the same model was tested on HeH^{+} ions with He–H distances up to values much larger than those in the training set. The model’s out-of-sample performance, as measured by the density difference to CCSD, remained much smaller than that of B3LYP even at bond distances around 3 Å for HeH^{+}. Furthermore, the density performance of the ML-DFT model outperformed that of B3LYP even in more complex systems (such as He–H–H–He^{2+}) with different numbers of electrons and nuclei than molecules in the training set, and Figs. 5(c) and 5(d) show the comparison of two different structures, respectively. The use of quasi-local electron density as input has yielded exceptional transferability of the ML-DFT model.

### B. Quasi-local XC energy density model

*ɛ*

_{xc}defined as follows:

Although targeting the energy densities shares similarities with the previous ML-DFT models for the XC potential, the model output is different, and careful consideration should be taken. Similar to the previous ML-DFT models, this model requires data in the form of XC potential or electron density at each grid point, and a sensible strategy is training targets for the entire grid. Unfortunately, unlike the XC potential, there is no procedure like the WY method^{91,92} to produce the energy density. Furthermore, the calculation of parameters requires second order derivatives, which can be computationally intensive. Nevertheless, automatic differentiation techniques and packages are now available to handle such calculations. Implementing the model involves saving the first derivative graph and including other numerical burdens in the backpropagation process to calculate second order derivatives. The XC energy and potential can be obtained via numerical manipulation from the XC energy density. To generate the total XC energy, the XC energy density can be integrated weighted by electron density.

*et al.*,

^{24}which employs a fully-connected neural network model trained with different electron density descriptors as inputs and the XC energy density as output. However, the XC energy density is not directly used for training loss function (only losses in total energy and electron density are employed). The electron density descriptors used in the model include various combinations of the following density-related quantities:

**g**the overall input vector concatenating all necessary input descriptors. Then, the XC energy density is parameterized as

*near region approximation*(NRA) in DFT. Depending on which terms are included, the formulation unifies various levels of detail about the local or quasi-local electron density. If all five descriptors are included, the ML-DFT model is referred to as NRA-type functional. To compute the XC potential from the XC energy density, a Monte Carlo method was used instead of backpropagation, avoiding complications from both the backpropagation through the inverse KS problem and the second-order derivative problem.

The resulting ML XC energy density model with local density descriptors shows a reasonable performance (see Fig. 4 in Ref. 24). However, the performance only becomes comparable to traditional hybrid functionals when the coarse-grained quasi-local density is included through the fifth descriptor (the NRA shown in the original paper). CCSD(T) and G2^{95} results are used as the reference data.

### C. XC fragment energy model

The HEDT guarantees the representability of the XC potential and XC energy (or energy density) by the quasi-local density. This one-to-one mapping between the local XC potential and quasi-local electron density can be utilized in several different ways. A slightly different approach from previous models is to divide the XC energy into contributions from naturally meaningful fractions (e.g., *atoms*).

As shown in Fig. 6, the electron density of a system is divided into four fragments, each with a unique mapping to the system’s properties. When the mapping *ρ*_{frag,i}↦*E*_{i} for any *i* ∈ {1, 2, 3, 4} for each fragment’s XC energy contribution *E*_{xc} = *∑*_{i}*E*_{i} is specified, it uniquely determines a quasi-local XC functional *E*_{i} = *E*_{xc}[*ρ*_{frag,i}]. This mapping is relatively straightforward to find with atomic division. The total XC energy of a molecule can be equated to the summation of XC energy contributions from constituent atoms, and a machine learning model can read and interpret quasi-local densities around each nucleus to output the corresponding atomic XC energy contribution.

It should be noted that even though the XC energy can be expressed as the summary of the contribution from individual atoms, even higher-order interactions among two or more atoms can still be partitioned into the single-atom contribution because the quasi-local density around each nucleus contains information from all orders. However, it is the machine learning model’s capability to determine how the energy contribution is split among the participating atoms. For instance, for a C=O bond in a specific environment, the XC energy correction attributable to the bond can be apportioned to both the carbon and oxygen atoms.

Atomic contributions to molecular potential energy surfaces (PES) have been constructed prior to the widespread use of deep learning models, as demonstrated in the work of Behler and Parrinello.^{49} However, to construct a truly universal XC functional that requires no additional information beyond the density itself, higher complexity models are necessary. Every aspect of the XC energy or potential arises from the subtle variations in the shape of the quasi-local density. Recent advancements in deep learning have made it possible to construct such models.

^{16}successfully demonstrated promising accuracy in small molecules using an XC energy fragment model based on atomic contributions. The model constructs specific neural networks for each atom type and samples the electron density surrounding each nucleus using Gaussian-orbital-like projectors. Symmetrized projected values serve as the input for the neural networks, with the output representing energy contribution from each atom. The total XC energy is calculated by summing the outputs of all atomic neural networks. Functional derivatives are needed with respect to density for SCF calculation. Figure 7 depicts the architecture of the ML-DFT model. Noticeably, the derivatives assume a rather simple transformation from density descriptors to density itself,

*β*is the index for different projectors,

*c*

_{β}is the projected value of the density of the projector, and

*ψ*

_{β}(

**r**) is the shape of the projector.

While the model developed by Dick *et al.* has shown promising accuracy for small molecules, it is not yet universal. The model relies on different neural networks and projectors for each type of atom, and different models were trained for different datasets. Specifically, the researchers developed three distinct models for three different datasets.

## VII. IMPROVING ML DFT MODEL PERFORMANCE

In this section, we will review existing approaches to improve the performance of ML DFT models. These include the use of different training strategies, designing specific loss functions, and imposing physical constraints that density functionals should satisfy. By implementing these methods, we can improve the accuracy of DFT calculations and enhance.

### A. Fully differentiable training with SCF calculations

To build a ML-DFT model that accurately represents the universal XC functional, the trained model with a fixed set of parameters should be applicable to any atoms, molecules, and materials. However, optimizing parameters during the training phase can be highly complicated due to the tangled relationship between the ML model and the SCF calculations. The parameters in the model should be optimized in a way that aids the SCF procedure in converging to the correct density. If the same model is invoked during each SCF calculation, one may isolate the SCF procedure from the model training. This problem has been solved by implementing the KS equation with differential programming,^{96–98} which is an emerging programming paradigm allowing one to take the derivative of an output of an arbitrary code snippet with respect to its input using automatic differentiation techniques.^{66}

One can combine the SCF calculation within the optimization procedure to better train a ML-DFT model. This idea has been first demonstrated in a simple 1D system by Li *et al.*^{21} Later on, Kasim and Vinko^{19} and Dick and Fernandez-Serra^{99} also implemented a neural network model for the three-dimensional molecules, where the derivatives can be computed by backpropagating through the SCF iterations. However, this approach requires a large amount of memory and may result in numerical instability when computing the derivatives, which makes it difficult to train on large dataset. One can apply the technique of implicit differentiation^{69} to reduce the computational complexity and memory footprint of the actual implementation.

### B. Designing loss functions

*ρ*

_{ML-KS}is the electron density after KS-SCF calculation with the ML-DFT model for the XC functional or potential,

*ρ*

_{target}is the target or reference electron density, and $Etrain[\u22c5]$ indicates the averaging operation over the training set.

*E*

_{xc}(or the XC energy density

*ɛ*

_{xc}), in addition to reproducing the electron density, the loss function in energy,

To construct an accurate ML-DFT model, it is important that the ML-DFT model not only reproduces the target energy but also reproduces the target electron density. The target electron density is often obtained from expensive *ab initio* methods. Gradient descent, or its variants, is commonly used for optimization during training. Automatic differentiation during backpropagation allows for effective computation of the gradient with respect to model parameters. If the density loss is included and the model is coupled with KS equations, backpropagation requires the inverse eigenvalue problem in the KS equations to be solved before parameter updates. This requires numerical techniques to access the network for parameter updates. Alternatively, reproducing the target density can be enforced by using the potential loss only as shown in Ref. 14.

### C. Physical constraints for ML-DFT models

Although ML techniques have been widely employed for finding the exact form of universal XC functionals, these MLB XC functionals are seen as *black boxes* and may not satisfy the physical constraints that the XC functional should obey in principle. For instance, the exchange-energy density of any finite many-electron system satisfies the exact 1/*r* asymptotic behavior.^{58} This theoretical insight may be useful when designing parameterizations of new MLB XC functionals. Moreover, other physical constraints, such as spin-scaling^{100} for the exchange energy and the Lieb–Oxford bound^{101} for the exchange-correlation energy, are derived from fundamental principles of DFT and, thus, can also be used to guide the ML modeling.

Recent efforts of designing MLB XC functionals satisfying certain physical constraints have been made to address this issue by integrating ML modeling and exact-constraint satisfaction.^{102,103} This approach has shown promise in producing ML constructed XC functionals that satisfy physical constraints and exhibit improved transferability and accuracy over traditional approximations.

## VIII. OUTLOOK

### A. General quasi-local descriptor formalism

The quasi-local electron density, which contains enough intrinsic information about the molecular system as dictated by the HEDT, is clearly a more suitable descriptor for training a better ML-DFT model compared to that of using either the local electron density or the global one. With the quasi-local electron density descriptors, one can parameterize the mapping from electron density to XC quantity with sufficiently many features to capture the details of the mapping. Once the electron density is given, the XC quantities are uniquely determined.

The general workflow of a quasi-local ML-DFT model is depicted in Fig. 8. The quasi-local electron density distribution *ρ*_{in}(**r**; **r**_{0}) around **r**_{0} is inputted as the descriptors to the ML-DFT model; it outputs the intermediate XC potential *v*_{xc}(**r**_{0}) or XC energy density *ɛ*_{xc}(**r**_{0}) at **r**_{0} that can be used in the subsequent KS solver. The input electron density function may be obtained by CCSD, the quantum Monte Carlo method, or other high-precision quantum chemistry methods. After the KS solver, a new charge density function *ρ*_{new} and other physical properties such as the total energy are obtained, and these can be used to form the loss function to train the ML model by comparing with the high-precision electron density and/or other quantities such as high-precision energy. Once the training is complete, the resulting ML-DFT model can be employed in the SCF calculation to calculate highly accurate physical properties such as electron density and total energy of the system.

Alternatively, *ρ*_{in} can be calculated via the conventional DFT methods, such as B3LYP, as the B3LYP version of *ρ*_{in} has a one-to-one correspondence to the higher-precision *ρ*_{in} (for instance, CCSD). The advantage is that no SCF calculation is required to calculate the molecular properties, once the ML-DFT model is built, and the input *ρ*_{in} can be obtained by employing the conventional DFT calculation.

One may extend the NN-based B3LYP functional developed in Ref. 5 into a quasi-local descriptor-based ML-DFT model. In this case, the ML-DFT model outputs a set of space-dependent coefficients {*a*_{0}, *a*_{X}, *a*_{C}}, which calibrates the original B3LYP functional. We remark that an additional correction term Δ*E* of the energy functional can also be added to enhance the model’s capability to calculate the absolute energy. Besides the approach of outputting the space-dependent coefficients and the correction term, one may also directly target an energy density^{104} of the underlying energy functional as an intermediate output. Adding energy density within the ML-DFT model might be useful to obtain either the XC potential (by automatic differentiation) or the XC functional (by numerical integration^{105}).

^{106}This is a quasi-local version of the electron density formulation of the NN-based ML-DFT model reported in Ref. 5. Instead of learning the mapping

*ρ*

_{quasi-local}→

*v*

_{XC}from scratch, the model learns the space-dependent coefficients combining three existing functionals as follows:

**f**

_{θ}is a row vector of 3 elements outputted by the machine learning model, while $\epsilon XLDA(r)$,

*ɛ*

^{HF}(

**r**), and

*ɛ*

^{ωHF}(

**r**) are the local LDA,

^{56}local Hartree–Fock, and local range-separated Hartree–Fock energy densities (see Ref. 107), respectively. An extra D3

^{108}correction was added to the ML functional $EXCMLP$ to produce the final XC energy prediction.

*c*

_{β}has a kernel of Gaussian orbital shape

^{16}], there can be multiple fragment energies contributing to the potential at any given positions, allowing for a smooth transition between fragments.

Once the ML-DFT model is trained for a specific type of XC quantities, it can be incorporated into the SCF calculations and subsequently used for post-processing the molecular properties of interest, as in traditional DFT calculations. The quasi-local density descriptor approach emerges as the mainstream approach to constructing the ML-DFT model. The remaining is how best to design and represent the quasi-local electron density. Moreover, the electron density is also being used as the target, as it is the key entity in DFT and contains the health of information. More research is expected in this direction.

### B. ML models for van der Waals interaction

An accurate description of the van der Waals (vdW) interaction is challenging for traditional DFT, as it is weak and is due to the interaction of transient atomic dipoles. While some conventional DFT approximations have shown remarkable performance in certain systems,^{109} they often rely on nonlocal quantities that make them difficult to apply in the quasi-local ML-DFT method.

Because the vdW interaction is caused by the interaction among transient atomic dipole moments, it can be, in principle, machine-learnt from the electron density. As the vdW is weak, a minute change in electron density is induced. The minor changes in density and their corresponding XC potential are both higher order effects in a perturbative sense rather than the cause of vdW interaction. It is thus possible to machine-learn the vdW interaction directly from the electron density ignoring the high-order electron density changes.

*ρ*

_{0}with a small perturbation

*δρ*,

It is evident from Eq. (15) that the minor density change resulting from the vdW interaction can be mostly ignored when calculating the XC potential during SCF, within reasonable accuracy requirements. The second term, which accounts for second-order variation in the electron density, is significantly smaller than the first term, as the change in density for vdW interaction is minimal. However, the energy shift due to vdW interaction is significant and cannot be neglected. Therefore, including an additional correction term for vdW interaction after SCF calculation is a reasonable approach. A separate vdW ML model can be trained using the quasi-local electron density and added to the current ML XC model as an extra correction term to the XC energy.

Empirical correction approaches, like the widely-used DFT-D3 method,^{108} are computationally efficient but limited in their effectiveness due to their reliance on a few empirical parameters and their sensitivity to specific systems. On the other hand, a customized ML model with a large number of tunable parameters and degrees of freedom may bring significant improvements.

Recently, Proppe *et al.* employed Gaussian process regression^{110} to correct systematic errors in DFT calculation with D3-type dispersion corrections.^{111} This model is referred to as D3-GP in the original work. The training data, consisting of 1248 samples of molecular dimers, are the differences between interaction energies obtained from PBE-D3(BJ)^{108,112,113}/ma-def2-QZVPP^{114,115} and DLPNO-CCSD(T)^{116,117}/CBS^{118} calculations. Once provided with reference data for new molecular systems, the underlying D3-GP model can learn to adapt to these and similar systems. The D3-GP model outperforms the existing PBE-specific correction schemes^{113,119,120} with respect to three different validation sets. One may expect that with sufficient training data, an ML model for vdW correction is likely to outperform existing empirical models for dispersion correction. Once the ML-vdW model is trained and validated, combining this ML model with the quasi-local ML-DFT model is straightforward.

### C. Other future research directions

The full potential of the ML-DFT model can be explored by utilizing larger and more diverse datasets that can significantly benefit the modeling of ML-DFT calculations. By incorporating diverse molecules, chemical environments, and properties, the ML-DFT models can capture finer details of the exchange-correlation interaction and thus improve the model’s generalizability. Expanding the dataset to include molecules with various sizes, complexities, and properties would enhance the training of ML-DFT models and enable more accurate representations for the XC quantities. While maintaining the efficiency of model training would become challenging, larger models with a higher number of parameters may effectively capture intricate features and correlations in the data, leading to improved accuracy and reliability in ML-DFT models.

Recently, the notion of neural operator^{121} and the technique of operator learning^{122} gain much attention from different scientific communities. The goal of operator learning is to seek a directly functional relation that maps elements from an infinite dimensional space to another infinite dimensional one. One of the great features of operator learning is that the parameterization of the mapping is discretization invariance, i.e., the resulting mapping from ML models is independent of the resolution of input and output data, as the operator learning model aims to learn the intrinsic structure of the map between the abstract spaces. One may expect that this approach could benefit the explore of XC functionals that map electron densities, which are smooth functions of spatial variables, to the energies of the underlying quantum systems. Moreover, by incorporating domain knowledge and physical constraints, ML-DFT models may have better representability for the exchange-correlation quantities, leading to the development of more accurate and physically meaningful XC functionals.

## IX. CONCLUDING REMARKS

The explosive development in AI has catalyzed a quick turnover of machine-learning models for density functional theory. From an algorithmic perspective, most of the above-mentioned approaches have focused on applying ML architectures such as artificial or convolutional neural networks to learn the XC functionals. However, other promising candidates, such as graph neural networks (GNN), recurrent neural networks (RNN),^{123} and transformers,^{12} are also being explored for overhauling the design of XC functionals. GNN extends CNN toward irregular grids for electron density or XC potential. RNN is ideal for time-dependent data and may find profound applications in time-dependent DFT. On the other hand, transformers and other attention-based models allow the model to be smarter by deciding where to pay attention in the electron density or XC potential. Given the subtlety and sensitivity of electron density data in DFT problems, attention-based models may be a good fit.

Here, we have reviewed the machine learning approaches for constructing XC-related quantities (such as energy functional or potential) in DFT. The review began with a discussion of two pioneering works, ML-DFT models that use global descriptors and progressed toward more intuitive and transferable quasi-local models, concluding with an additional ML term for vdW interaction. For the quasi-local descriptor models, we introduced the holographic electron density theorem as the theoretical foundation and presented a series of successful implementation schemes. All quasi-local ML-DFT models (such as the ML XC potential model) share the same fundamental design elements and have deep physical connections. We have demonstrated successful stories for these variants,^{14,16,21,24} and we encourage readers to read the respective original papers, as well as the open-source codes and examples provided. We hope that new generations of ML-DFT models will accurately construct the universal XC functional of DFT in the near future, revolutionizing the field of quantum chemistry, similar to how AlphaFold^{124} has transformed the field of structural biology.

Forward looking, the eventual ML-DFT model for the XC functional should have the following features. First, the descriptors should be made of the quasi-local electron density. Second, the targets should include the high precision electron density; and this can be the explicit target, or the implicit, for instance, in Ref. 5, the explicit target is the XC potential, which, in turn, leads to the high precision electron density by solving the KS equation. Finally, the XC potential and energy density can be the output or the intermediate target that leads to the target electron density. An additional machine-learning module for the vdW interaction may also be included in the workflow to deal with the weak interaction of transient atomic dipoles. Ultimately, the ML-DFT model combined with the vdW interaction module should be able to accurately reproduce the target energy and the target electron density for any molecular system.

## AUTHOR DECLARATIONS

### Conflict of Interest

The authors have no conflicts to disclose.

### Author Contributions

**Jiang Wu**: Conceptualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **Sai-Mang Pun**: Conceptualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **Xiao Zheng**: Conceptualization (equal); Writing – original draft (equal); Writing – review & editing (equal). **GuanHua Chen**: Conceptualization (lead); Writing – original draft (equal); Writing – review & editing (equal).

## DATA AVAILABILITY

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

### APPENDIX A: TUTORIAL OF THE ML-DFT XC POTENTIAL MODEL

For a hands-on experience, the readers are encouraged to try out our open-source package on GitHub.^{125} Most of the codes were written in Python, and the ML models were built with the open-source package PyTorch.^{126} For a better understanding of the implementation details, inexperienced readers are recommended to go through a comprehensive tutorial of PyTorch, before making any modifications to the models we provide. As a starting point, PyTorch provides tutorials of introductory level on their own website.

As a simple example, we make use of our model here using the XC potential generated by the Wu–Yang (WY) method^{91} as the direct training set. For H_{2} molecule (see Fig. 9 for the numerical results), training can be performed with the pre-calculated XC potential for H_{2} for various H–H bond distances, and no SCF calculation is needed. At the evaluation phase, full SCF calculation can be performed for each structure, and it is implemented with the PySCF package.^{127} To perform training and evaluation on this example, one may walk through the following steps below:

Before getting started, make sure all the prerequisites are installed and work properly.

Create and enter a new folder; download the code and dataset by typing

$ git clone https://github.com/zhouyyc6782/oep-wy-xcnn.git

Enter the example/simple_H2 directory. Create a folder called log by typing

$ mkdir log

to store the upcoming results and logged files.

Start training by typing

$ python ../nn_train/main.py train.cfg

Here, all training settings and hyper-parameters are defined in the .cfg file; to write a new .cfg file for a different configuration, please refer to the README file provided with the GitHub repo.

Training will start on the provided H

_{2}dataset; by default, the number of epochs is 1000.Perform SCF calculations on the newly trained model by typing

One can check the SCF performance of the model by examining the output file generated. A typical run for a small molecule like H

_{2}should result in an error at the level of 10^{−5}–10^{−7}in terms of*I*value. Since only one H_{2}structure is included in the simple_H2 training set, the error could be larger. Here, the*I*value between two (possibly different) densities is defined to be

This tutorial is centered on a pre-built dataset from one H_{2} structure for both training and SCF. To reproduce the result from the original paper,^{14} a modified and re-compiled version of PySCF is needed for generating WY target data from scratch with the codes in the folder oep-wy (while the SCF part only needs the vanilla version of PySCF). One can refer to the README in the GitHub repo for more details on installing a custom version of the PySCF package.

Those scripts (e.g., run_oep.py, gen_dataset.py, run_train.py, and run_xcnn.py) in the repo provide automatic scripts for generating data from WY calculation, collecting data, training the model with the data, and testing the model with SCF procedure, respectively. Interested readers are advised to follow the README from the GitHub repository in step 2) for re-compiling PySCF and additional custom implementations of the codes.

The codes provided within this tutorial constitute (i) data generation (with the WY method), (ii) training part, and (iii) SCF computation. One can build their own codes based on this GitHub repo for molecules or ions other than H_{2} in this simple example. Depending on the format of the dataset, one needs to write their own scripts like run_oep.py, gen_dataset.py, run_train.py, and run_xcnn.py, for automating the whole algorithmic procedure.

### APPENDIX B: OPTIMIZED EFFECTIVE POTENTIAL AND DATA GENERATION

The electron densities that are employed to train the ML models can be obtained using highly accurate *ab initio* methods such as wave-function based methods like CCSD.^{72} Besides the electron density, the values of XC potential are also needed. Given a density computed from CCSD, the corresponding XC potential can be calculated by various optimization procedures that effectively invert the KS equations (collectively referred to as the inverse Kohn–Sham methods; see also Ref. 128). The optimization procedure employed in Ref. 14 to generate a training dataset is the so-called Wu–Yang method (WY) developed in Ref. 91, which will be briefly elaborated here.

Readers might wonder that if a numerical optimization procedure can resolve XC potential from electron density, then why do we bother training an ML model that does the exact same thing? The answer lies in the core concept of DFT itself. What we want to predict from the ML model is the universal XC functional that maps any density to its corresponding XC potential. On the other hand, the optimization procedure only solves system-specific XC potential that is associated with a particular known electron density. The procedure entails only the mathematics of inverting KS equations, which does not include the physics of the many-particle system at all. In contrast, the ML model tries to learn the intrinsic physics behind it, which is by definition fundamental. Those values of electron densities and XC potentials generated by inverse KS methods are fed to the ML model as training data.

*ρ*

_{in}, one first constructs a Lagrangian, denoted as

*W*

_{s}, in terms of the total effective potential (denoted as

*v*) and the single particle wave functions (denoted as

*ϕ*

_{i}’s),

*N*!)

^{−1/2}det(

*ϕ*

_{i}(

*x*

_{j})) is Slater’s determinant function associated with the orbitals

*ϕ*

_{i}’s, and

*v*(

**r**) serves as a Lagrange multiplier. When

*W*

_{s}is stationary with respect to

*v*, the electron density becomes the same as the given density

*ρ*

_{in}and

^{129}Once the effective potential is calculated, the XC potential

*v*

_{xc}can be easily found by subtracting the external and the Hartree potentials.

^{130}

With the pair of density and XC potential being obtained, the training procedure is decoupled from the KS SCF procedure, and the resulting ML model converts its inputs *ρ* into the outputs *v*_{xc}. Training proceeds with a typical backpropagation procedure with an optimizer using stochastic gradient descent (SGD)^{131} or the Adam method.^{132} Once large enough data are accessible for various types of molecules and quasi-local environments, the parameters in the ML XC potential model can be better trained and yield a more accurate and universal XC potential of real molecular systems.

## REFERENCES

*Advances in Neural Information Processing Systems*

*Quantum Chemistry in the Age of Machine Learning*

^{4}

*A Primer of Real Analytic Functions*

*Nanoscale Phenomena: Basic Science to Device Applications*

*Ab initio*calculation of force constants and equilibrium geometries in polyatomic molecules: I

*Numerical Recipes*

*ab initio*parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu

_{2}F

_{2})

_{2}and (SO

_{2})

_{2}

*Gaussian Processes for Machine Learning*

*Advances in Neural Information Processing Systems*

*Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory*