Many proteins are regulated by dynamic allostery wherein regulator-induced changes in structure are comparable with thermal fluctuations. Consequently, understanding their mechanisms requires assessment of relationships between and within conformational ensembles of different states. Here we show how machine learning based approaches can be used to simplify this high-dimensional data mining task and also obtain mechanistic insight. In particular, we use these approaches to investigate two fundamental questions in dynamic allostery. First, how do regulators modify inter-site correlations in conformational fluctuations (*C*_{ij})? Second, how are regulator-induced shifts in conformational ensembles at two different sites in a protein related to each other? We address these questions in the context of the human protein tyrosine phosphatase 1E’s PDZ2 domain, which is a model protein for studying dynamic allostery. We use molecular dynamics to generate conformational ensembles of the PDZ2 domain in both the regulator-bound and regulator-free states. The employed protocol reproduces methyl deuterium order parameters from NMR. Results from unsupervised clustering of *C*_{ij} combined with flow analyses of weighted graphs of *C*_{ij} show that regulator binding significantly alters the global signaling network in the protein; however, not by altering the spatial arrangement of strongly interacting amino acid clusters but by modifying the connectivity between clusters. Additionally, we find that regulator-induced shifts in conformational ensembles, which we evaluate by repartitioning ensembles using supervised learning, are, in fact, correlated. This correlation Δ_{ij} is less extensive compared to *C*_{ij}, but in contrast to *C*_{ij}, Δ_{ij} depends inversely on the distance from the regulator binding site. Assuming that Δ_{ij} is an indicator of the transduction of the regulatory signal leads to the conclusion that the regulatory signal weakens with distance from the regulatory site. Overall, this work provides new approaches to analyze high-dimensional molecular simulation data and also presents applications that yield new insight into dynamic allostery.

## INTRODUCTION

Regulation of protein activity is essential for normal cell functionality. Many proteins are regulated allosterically with spatial gaps between stimulation and active sites and biological stimuli that regulate them include, for example, small molecules, post-translational modifications, and changes in intensive state-variables like temperature and pH. These stimuli are known to not only switch activities on-and-off but also fine-tune them.

Understanding allosteric regulation of proteins requires not only assessment of the various individual states, including inactive and active states, but also relationships between states. For many proteins, atomically detailed models of regulation can be constructed in terms of how their representative (minimum-energy) structures differ between states.^{1–9} But in such cases, the representative structures differ significantly between states in that thermal fluctuations are negligible compared to structural differences. Such regulatory models, however, cannot be constructed when differences between representative structures are small and comparable to thermal fluctuations.^{10–18} Understanding such dynamic allosteric regulation requires assessment of conformational ensembles of individual states, as well as relationships between conformational ensembles of different states. This challenge faces several major protein families, including G protein coupled receptors, nuclear transcription factors, heat shock proteins, immune cell receptors, and viral attachment proteins.

Nevertheless, since energy drives structure and dynamics, molecular simulation techniques, which assess them as functions of energies, are capable of providing direct relationships between energy, structure, and dynamics. Many molecular simulation methods are available to obtain response functions of external stimuli and to generate conformational ensembles for protein states. However, constructing a molecular basis for any perturbative response, which is also key to making testable forward predictions, also requires analysis of conformational ensembles.

Toward this end, it has been shown that in a given state, correlations in equilibrium thermal fluctuations are related directly to the propensity for signal transduction.^{19} Given time-dependent conformations of two protein sites *i* and *j* in a specific state, *f*_{i}(*t*) and *f*_{j}(*t*), their fluctuation correlations are determined as

where the bar denotes average and the *σ* denote fluctuations in individual sites. Many approaches, such as undirected graphs, have been employed for analyzing these *C*_{ij} in the context of the spatial organization of the protein to obtain insight into how different spatial regions communicate with each other.^{19–30} These studies have provided valuable insight into molecular details underlying dynamic allostery. Nevertheless, *C*_{ij} are computed from the active state of the protein, and it remains generally unexplored as to how regulators modify *C*_{ij} and the resulting inter-site signal communication.

Understanding dynamic allostery also requires a statistical assessment of the differences between conformational ensembles of states. Toward this end, we and others have developed methods and metrics for quantifying differences between ensembles.^{31–37} These methods have been used to tease out protein regions affected by regulators, rank individual sites by the extent of their regulator-induced changes, and also statistically analyze the effect of different regulators on conformational ensembles, providing new mechanistic insight into dynamic allostery.^{31–37}

Nevertheless, the question that still remains unexplored is whether there exists a relationship between regulator-induced ensemble shifts at two different sites in a protein. From a phenomenological standpoint, proteins regulated by dynamic allostery are expected to exhibit a subset of conformational densities that are common to different states, and regulators control protein activity by simply introducing a bias on those densities.^{10–18} A proposition that follows directly from this governing principle is the possibility that perhaps regulator-induced density shifts at two sites in a protein may be correlated. In other words, if regulators bias the conformational densities of two sites *i* and *j*, that is, *f*_{i} → *g*_{i} and *f*_{j} → *g*_{j}, then the question is whether the fluctuations in shifts are also correlated. This requires us to determine whether ensemble shifts in this site, $gi*\u2212fi\xaf$, where $gi*=gi\(fi\u2229gi)$, are related to ensemble shifts, $gj*\u2212fj\xaf$, of another site *j*. Analogous to (1), this relationship can be quantified by computing

where *σ** denotes fluctuations in ensemble shifts.

In this work, we address this question as well the aforementioned question on how regulators affect *C*_{ij}. We address them by first developing a supervised learning method to repartition high-dimensional conformational ensembles, which then enables the computation of Δ_{ij}. We then use a combination of unsupervised machine learning and graph analysis techniques to analyze correlation data. We address these issues using, as a model system, the PDZ2 domain of human protein tyrosine phosphatase 1E (hPTP1E), which is one of the most well studied proteins in the context of dynamic allostery.^{38–44} In general, PDZ domains are characterized as protein-protein interaction modules and are found in abundance in various species. PDZ domains have been shown to use multiple surfaces for docking a diverse set of molecules, and these binding events can be allosterically regulated.^{44} As such, PDZ-containing proteins interact with multiple proteins within cells, and so understanding the regulation of PDZ enabled protein-protein interactions, including allostery, is vital to understanding their biology. In the case of the PDZ2 domain of hPTP1E, X-ray crystallography shows that the binding of the RA-GEF2 (Ras-association Guanine nucleotide Exchange Factor 2) terminal peptide induces a backbonal change smaller than 1 Å,^{45} a magnitude comparable to thermal fluctuations. This result is also supported by NMR chemical shift measurements.^{39} Figure 1 compares the conformational ensembles of PDZ2 in its GEF2-bound and GEF2-free states obtained using all atom molecular dynamics (MD) simulations (see the section titled Methods). Despite the several biophysical and mutational studies, including those mentioned above, the mechanism of PDZ2 stimulation by its natural regulator still remains largely unknown. The study we undertake here sheds new mechanistic light into PDZ2 regulation.

## METHODS

### Molecular dynamics

We use molecular dynamics (MD) for generating conformational ensembles of PDZ2 in its GEF2-free and GEF2-bound states. The starting coordinates are taken from crystallographic structures deposited in the Protein Databank (PDB IDs: 3LNX and 3LNY).^{45} Hydrogens are added and their positions are optimized using PDB2PQR.^{46} The N- and C-termini of the GEF2 peptide and the protein are capped by adding acetyl (ACE) and *N*-methylamide (NME), respectively. The GEF2-free and the GEF2-bound structures are placed in cubic boxes containing ∼11 K water molecules, including those resolved crystallographically. KCl concentration is set at 75 mM, and there are extra K^{+} ions compared to Cl^{−} ions to compensate for the charge on the GEF2 peptide. MD simulations are carried out in duplicates (different random seeds for velocities) for both the GEF2-free and GEF2-bound states of PDZ2, and each MD simulation is 0.5 *μ*s long. Backbone root mean square deviations (RMSDs) and potential energies are tracked separately for all simulations, and since RMSDs take 0.25 *μ*s to level off for one trajectory, we discard the first 0.25 *μ*s of each trajectory and use only the final 0.25 *μ*s of each trajectory for analysis. Additional details on assessing convergence and construction of conformational ensembles are discussed in the section titled Results.

All four MD simulations are carried out under isothermal-isobaric boundary conditions, and using Gromacs version 4.5.3.^{47} Temperature is maintained at 298 K using an extended ensemble approach^{48,49} and with a coupling constant of 1 ps. An extended ensemble approach is also used for maintaining pressure.^{50} Pressure is maintained at 1 bar using a coupling constant of 1 ps and a compressibility of 4.5 × 10^{−5} bar^{−1}. Electrostatic interactions are computed using the particle mesh Ewald scheme^{51} with a Fourier grid spacing of 0.1 nm, a fourth-order interpolation, and a direct space cutoff of 10 Å. van der Waals interactions are computed explicitly for interatomic distances ≤10 Å. The bonds in proteins and the geometries of water molecules are constrained,^{52,53} and consequently an integration time step of 2 fs is employed. The protein and ions are described using Amber99sb-ILDN parameters,^{54} and water molecules are described using SPC/E (Single Point Charge Extended) parameters.^{55}

### Inter-site fluctuation correlations

If *f*_{i} and *f*_{j} represent the distributions of two sites in a protein, then their fluctuation correlations *C*_{ij} are determined by (1). When the distribution is discrete and the data are vectors, that is, *f*_{i} = {**f**_{i1}, **f**_{i2}, …, **f**_{im}}, such as that obtained from molecular simulations, then (1) takes the following form:

where |…| denotes the absolute value. When applied to proteins in the context of constructing signaling networks, the vectors **f**_{ik} and **f**_{jk} are generally those of *C*_{α} atoms or centers-of-mass (CoMs) of backbone atoms of two amino acids, *i* and *j*.^{19,20,22,24,26,27} Here, we take **f**_{ik} and **f**_{jk} as the centers of mass of two amino acids *i* and *j* so that they also represent, although coarsely, the structural fluctuations in side chains. Including side chain fluctuations is important because GEF2 has been shown to affect the dynamics of PDZ2 side chains more than their respective backbones.^{39,45}

### Inter-site correlations in ensemble shifts

If *f*_{i} and *g*_{i} represent two different distributions of the same site *i*, but under the influence of different external potentials, and if *f*_{j} and *g*_{j} represent the corresponding distribution of site *j*, then inter-site correlations in ensemble shifts Δ_{ij} are given by (2). Calculation of Δ_{ij} requires that the ensemble data, *f* and *g*, are repartitioned such that conformations corresponding to the overlap region *f* ∩ *g* are identified separately for each site and then removed from *f* and *g* to get the residuals *f** and *g**, respectively. Below we show that such a high-dimensional repartitioning task can be accomplished using the mathematical framework of support vector machines (SVMs). The development below follows from our SVM-based method to compute quantitative estimates for overlaps between conformational ensembles,^{34} which we also describe briefly for the sake of completeness.

SVMs are traditionally used for predicting binary classification of data.^{56–59} A SVM is first trained on a set of instances (*x*_{1}, *x*_{2}, …) with known group identities (*y*_{1}, *y*_{2}, … = {−1, +1}) and then the trained SVM is used for predicting the group identity of an unclassified instance. A SVM can also be constructed when the instances are 3*N*-dimensional molecular conformations (**r**) and belong to two ensembles, *f* = {**r**_{1}, **r**_{2}, …, **r**_{m}} and *g* = {**r**_{m+1}, **r**_{m+2}, …, **r**_{2m}}.^{60} Here we use the mathematical framework of SVMs, not for predicting classification but for repartitioning the pre-classified ensembles *f* and *g* to obtain the subsets *f** and *g**.

The training of the SVM is set up as a Lagrange optimization problem, where the goal is to maximize the linear separation between the two groups in some Hilbert space. Essentially, two hyperplanes,

with *y*_{i} = ±1 are sought that are constructed from a subset of the instances such that the distance 2/∥**w**∥ between the hyperplanes is maximized. This distance is maximized by minimizing

with respect to ∥**w**∥ and *b* and maximizing it with respect to the Lagrange multipliers *α*_{k}. Note that the square on ∥**w**∥ permits quadratic optimization and the 1/2 coefficient is introduced for mathematical convenience. Substituting the conditions

and

into Eq. (5) rearranges the auxiliary function to

where *K*(**r**_{k}, **r**_{l}) = **r**_{k}·**r**_{l}, which is then maximized under the constraint *∑α*_{k}*y*_{k} = 0, *∀ k*. An additional box constraint is introduced, 0 ≤ *α*_{k} ≤ *C*, in which *C* serves an upper limit on the magnitudes of the Lagrange multipliers.

Note that in the optimization of (8), the feature used for classifying **r** is its linear projection on other **r**. The vectors **r**, however, are generally not linearly separable in the Euclidean space when they represent molecular conformations. Such issues are dealt with by choosing alternative kernels that are, by themselves, inner products in the transformed Hilbert space,^{57–59} that is, $K(rk,rl)=\varphi (rk),\varphi (rl)$. The primary advantage of such “kernel-tricks” is that they bypass the need to determine the explicit form of the function *ϕ*(**r**) that transforms the data from the original space to the desired Hilbert space to make the data linearly separable.

The optimization of (8) produces Lagrange multipliers that are ultimately plugged into the binary classifier, $F(r)=\u2211i=12m\alpha kykK(rk,r)$, to predict the group identities of previously unseen data **r**. More importantly, we note that the optimization of (8) produces two different sets of Lagrange multipliers, $\alpha k=0$ and $\alpha k>0$. The **r**_{k} whose corresponding *α*_{k} > 0 essentially define the maximum margin hyperplanes that are sought in (4), and these **r**_{k} are referred to as support vectors.

We have demonstrated^{34,35,37} that a Kernel can be chosen such that total number of generated support vectors *s*, where 2 ≤ *s* < 2*m*, can be used as a quantitative estimate of the normalized overlap between the two distributions, that is,

The difference between the two ensembles can then be quantified in terms of a normalized metric *η* ∈ [0, 1),

which takes up a value closer to unity as the difference between ensembles increases. Specifically, we showed that if we choose a Gaussian radial distribution function as the Kernel, that is,

then the parameter *γ* and the box constraint *C*, which together define the Hilbert space, can be optimized to yield overlaps with high accuracy. Choosing *γ* = 0.4 Å^{−2} and *C* = 100 results in an error in *η* that is less than 6% for multi-modal distributions.^{37} This error can be reduced further by choosing higher upper limits for Lagrange multipliers; however, the subtle gain in accuracy requires a significant compromise in the algorithm’s efficiency.^{37}

The finding that the ratio *s*/2*m* accurately quantifies the overlap between ensembles suggests that the support vectors **r**_{k} may actually be part of the overlap region (*f* ∩ *g*) in the Euclidean space. If this were true, then they can be simply removed from *f* and *g*, respectively, to obtain *f** and *g**. Figure 2(a) shows the distribution of support vectors in a test case of two partially overlapping 2D Gaussian distributions. Indeed, we find that the majority of the support vectors are part of the overlap region. However, a fraction of the support vectors does not belong to the overlap region but instead belong to *f** and *g**. This would imply that the ratio *s*/2*m* overestimates the overlap, and consequently the computed *η* is smaller than the analytical value. This is, in fact, what we noted previously^{34,35,37}—for almost all of our test cases involving various distribution types (unimodal, bimodal, trimodal, and quadrimodal), we found that the computed *η* are systematically underestimated (<6%) with respect to exact values.

Now if *f** and *g** were constructed by simply removing the support vectors from *f* and *g*, then *f** and *g** would, at worst, suffer from partial omissions of instances. More importantly, *f** and *g** will not be contaminated by instances belonging to *f* ∩ *g*. Figure 2(b) shows the average omission error in 50 random pairs of Gaussian distributions, one of which is shown in Fig. 2(a). We find that as the ensemble size (*m*) increases, the omission error reduces and for *m* ≥ 10 000, the average omission error is below 4%, and the worst case error is also below 7%, which are similar to errors we reported earlier^{37} in the estimation of *η*.

The support vectors can, therefore, be used to construct *f** and *g**, and without need for fitting the underlying distributions to assumed mathematical forms. *f** and *g** can be used to determine inter-site correlations in ensemble shifts. For discrete distributions and when the data are vectors, (2) takes the following form:

Note that the averaging is not performed over all conformations in $gi*$ and $gj*$. Instead it runs only over a subset of protein conformations that are common to both $gi*$ and $gj*$. Consequently, we introduce *p*_{ij}, which denotes the joint probability of finding protein conformations that are part of both $gi*$ and $gj*$. Note also that **f**_{i} and **f**_{j} represent 3*n*_{i} and 3*n*_{j} dimension vectors, where *n*_{i} and *n*_{j} are the numbers of atoms in the two amino acids *i* and *j*. Consequently, the support vectors that are generated are representative of entire conformations of amino acids. After repartitioning conformational ensembles of amino acids, we then represent **f**_{i} and **f**_{j} by their respective CoMs and compute Δ_{ij}.

### Undirected weighted graphs

Inter-site correlations *C*_{ij} are combined with each other and also connected with the spatial organization of sites using undirected weighted graphs *G*(*V*, *E*), where *V* represents the number of nodes and *E* represents the edges that connect the nodes.^{19,20,24,26,27} Nodes on graphs represent points on the proteins that serve as receivers and/or transmitters of information in signaling pathways. Since signal transduction generally needs to be understood at the level of amino acids, nodes on graphs typically represent amino acids,^{19,20,24,26,27} and we define them as CoMs of amino acids. From a physical standpoint, direct signal communication is expected to occur between only those nodes whose conformational spaces are directly influenced by each other.^{19,24,26,27} To implement this, one typically measures the distance between the CoMs of two nodes, and if that distance is less than a predefined cutoff, which is generally in the range 4–6 Å,^{19,24,26,27} then the two nodes are connected by a edge. Otherwise, the two nodes remain unconnected. Instead of using cutoffs, we implement a parameter-free approach that uses the same physical logic, but determines connectivity on the basis of overlap between node volumes—if Γ_{i} is the volume of the conformational ensemble of node *i*, then it will be connected to node *j* only if Γ_{i} ∩ Γ_{j} > 0. Edge weights represent a quantity that tells us how nodes communicate with each other. We define edge weights as the inverse of the inter-site correlations. Therefore, the preferred allosteric path that connects any two nodes will be the shortest weighted path. We use Dijkstra’s algorithm^{61} implemented in Igraph^{62} to solve for shortest paths between all *V*(*V* − 1)/2 node pairs.

## RESULTS AND DISCUSSION

### Construction of ensembles

We generate duplicate MD trajectories of the PDZ2 domain in its GEF2-free and GEF2-bound states. Each trajectory is 0.5 *μ*s long, and we use the second halves of these trajectories for analysis and for constructing conformational ensembles.

To determine whether the second halves of these trajectories provide adequate representations of conformational ensembles, we compute residue-wise differences between conformational ensembles constructed from duplicate trajectories. For each residue *i* in PDZ2, we determine $\eta i1\u21942$, which is a quantitative estimate of the difference between their ensembles $fi1$ and $fi2$ taken from the duplicate trajectories. Note that *η* is bounded, that is, *η* ∈ [0, 1), and takes up a value closer to unity as the difference between ensembles increases. Each of the two duplicate ensembles *f*^{1} and *f*^{2} contain 5000 snapshots extracted at regular time intervals from their respective trajectories. Note also that prior to extracting the coordinates of a residue, the entire structure of PDZ2 is least-square fitted on to the starting structure, which removes the bias against whole molecule rotation and translation.^{34–37}

Figure 3 shows the cumulative distribution of residue-wise *η* computed separately for both the GEF2-free and GEF2-bound state ensembles. We find that 90% of the residues have *η* values smaller than 0.35, which is equivalent to a mean position difference of less than $erf(0.35/2)=0.27$ Å,^{35} showing that the differences between the duplicate trajectories are small. We also note that the *η* of a few residues, especially in the apo state, are large, but an inspection of residue identities reveals that they belong to the N- and C-termini of the PDZ2 domain. We exclude these terminal residues from further analysis.

Instead of discarding the data from the duplicate trajectories, we combine the ensembles from the duplicate trajectories and create one representative 10 000-conformation ensemble for each of the GEF2-free (*f*) and GEF2-bound states (*g*). We then estimate the difference between these ensembles $\eta if\u2194g$ and compare them to the $\eta i1\u21942$ estimated between duplicate trajectories. We find, in general, that $\eta if\u2194g>>\eta i1\u21942$, which shows that the statistical differences between duplicate trajectories are smaller than the differences between the GEF2-free and GEF2-bound ensembles. Together, this analysis shows that the latter half of the trajectories provide adequate representations of conformational ensembles of the two states.

### Comparison with experiment

The MD protocol reproduces well the methyl deuterium order parameters ($Saxis2$) obtained from NMR.^{39,63} Figure 4 compares the $Saxis2$ values computed from one of the two sets of trajectories to experimental estimates. The results are similar for the duplicate set of trajectories. We compute $Saxis2$ from MD trajectories by modeling the autocorrelation function^{64}

as an exponential decay

an assumption also used in estimating order parameters from NMR spectral densities. In the expressions above, $\mu ^$ are the unit vectors of C–C bonds in which the latter carbon is part of the methyl group, and *τ* is the relaxation time. We assume here that the order parameters of the C–C(H_{3}) bonds represent those of the hydrogens in the CH_{3} groups.

Figure 5 compares the GEF2-induced shifts in residue centers-of-mass (CoMs) and root mean square fluctuations (RMSFs). In general, we note that GEF2 affects the structure and dynamics of residue side chains more than their respective backbones, a result consistent with previous studies.^{39,45} Such a form of induced changes has contributed to the prevailing hypothesis^{39} that allosteric regulation in PDZ2 occurs primarily through changes in side chain structure and dynamics. However, since there is no formal relationship between the extent of induced shifts and signaling propensity, contributions from residues undergoing relatively small changes in backbone structure and dynamics should not be ignored.

### Equilibrium fluctuation correlations (*C*_{ij})

Using (3), we compute separately inter-site fluctuation correlations in the GEF2-free state ($Cijfree$) and the GEF2-bound state ($Cijbound$). These correlations are shown in Fig. 6. A visual inspection of *C*_{ij} suggests that GEF2 binding does not alter the overall correlation pattern. To analyze this quantitatively, we cluster the *C*_{ij} matrices using affinity propagation.^{66} Unlike most other clustering algorithms, affinity propagation is an unsupervised machine learning algorithm that does not assume *a priori* the number of clusters or a cutoff value for delineating clusters. Instead, the algorithm takes only as input the measures of similarity between pairs of data points, which we define as the inverse of the correlation, that is, 1/*C*_{ij}. We test for convergence by progressively doubling the optimization steps and checking for constancy. Affinity propagation yields almost identical clusters for the GEF2-free and GEF2-bound states with minor differences limited to the residues in the N- and C-termini of the PDZ2 domain, confirming that GEF2 binding indeed does not alter the overall correlation pattern.

Inter-site signal communication, however, does not solely depend on clustering patterns in *C*_{ij}. It also depends on connectivity between and within clusters.^{19–30} To further examine the effect of GEF2 binding the inter-site signaling we use *C*_{ij} to construct 2D undirected graphs and carry out a flow analysis of the graph.^{19,24,26,27} The nodes on the graph represent residues, and edge weights in the graph are taken 1/*C*_{ij}, implying that the optimal connectivity between two nodes will be given by the shortest weighted path connecting the nodes. We connect only those nodes by edges that physically interact with each other. Typically, two residues are connected directly if the distances between the average CoMs of their corresponding residues are less than a predefined cutoff, which is generally around 5 Å.^{19,24,26,27} Instead of using pre-defined cutoffs, we compute the overlap between the volumes of residue conformational ensembles, and two residues (nodes) are considered to physically interact (connected) if their volume overlap is non-zero.

We construct two such graphs, *G*^{free} using residue ensembles and correlations in the GEF2-free state and *G*^{bound} using correlations and residue ensembles in the GEF2-bound state. Note that while the two graphs have the same number of nodes (residues) *V* = 89, they have different numbers of connected edges (interacting residue pairs)—*G*^{free} has 810 edges and *G*^{bound} has 714 edges. On average, each node in *G*^{free} has 9.1 edges and each node in *G*^{bound} has 8.0 edges. We then solve for shortest weighted paths between all *V*(*V* − 1)/2 node pairs. After solving for shortest paths, we count how many times each node appears in the *V*(*V* − 1)/2 shortest paths, Ω_{i}. Note that (*V* − 1) ≤ Ω_{i} ≤ *V*(*V* − 1)/2. We do this separately for each of the two graphs, and so for each graph, we obtain a separate set of node-occurrences {Ω_{i}}. As before,^{19,24,26,27} we assume that a residue that has a higher Ω contributes more to allosteric signaling, and so we define Ω to be a residue’s signaling propensity.

For each graph, we rank separately its residues in decreasing order of their signaling propensity Ω. This yields, for each of the two graphs, an ordered set of signaling propensities. We find that the Pearson correlation between residue ranks in the two graphs is small (*ρ* = 0.18). The correlation is even smaller if only the top ranked (25%) residues are considered (Fig. 7). Now even if the ordering in the top ranked residues is ignored, we find that the pairwise identity overlaps between the two states are about 50%. This implies that if a residue contributes significantly to signaling in the GEF2-free state, it does not necessarily imply that it will also contribute to signaling in the GEF2-bound state. The residue list in Fig. 7 is also color-coded to highlight NMR order parameter changes (|Δ*S*^{2}|) effected by GEF2 binding. We note no apparent relationship between the signaling propensity of a residue and the |Δ*S*^{2}| it undergoes. As such, there is no formal relationship between the extent of induced shifts and signaling propensity, yet several comparisons between the two exist in the literature.

Taken together with results from unsupervised clustering, we find that GEF2 binding significantly alters the global signaling network in the protein, however, not by altering the spatial arrangement of strongly interacting residue clusters, but by modifying the connectivity between and within clusters.

### Correlations in ensemble shifts

To enable computation of Δ_{ij} using (12), we need to repartition the GEF2-free (*f*) and GEF2-bound (*g*) conformational ensembles of each amino acid such that conformations corresponding to their overlap region *f* ∩ *g* are identified and then removed from *f* and *g* to get the residuals *f** and *g**, respectively. We accomplish this task using supervised machine learning as detailed in the section titled Methods. Figure 8 shows the repartitioning results for two representative amino acids, S29 and R79, that undergo, respectively, the highest GEF2-induced changes in CoM and RMSF (see Fig. 5). A visual inspection indicates that most of the support vectors are part of the overlap region. However, there are some support vectors that are not part of the physical overlap region and are fencing the individual *f* and *g* ensembles. These conformations will end up being erroneously removed from *f* and *g* during the construction of *f** and *g**, but our analysis in Fig. 2 indicates that this omission error will be small.

We compute Δ_{ij} in two ways: (a) $\Delta ijfree\u2192bound$, which is the correlation in shift in the GEF2-bound state with respect to the GEF2-free state, and (b) $\Delta ijbound\u2192free$, which is the correlation in shift in the GEF2-free state with respect to the GEF2-bound state. $\Delta ijfree\u2192bound$ and $\Delta ijbound\u2192free$ are expected to be identical only if the conformational densities in the GEF2-free and GEF2-bound states for both residues *i* and *j* are symmetric about their interface.

Figure 9 shows these correlations. We note first that Δ_{ij} are generally weaker than *C*_{ij}, but at the same time, there do exist distinct residue clusters whose Δ_{ij} are large. The clustering patterns in $\Delta ijfree\u2192bound$ and $\Delta ijbound\u2192free$, as determined from affinity propagation,^{66} have some similarities, but are not identical, highlighting that the conformational densities in the GEF2-free and GEF2-bound states are not always symmetric about their interfaces.

To examine the clustering patterns in Δ_{ij}, we plot the highest Δ_{ij} of each residue as a function of distance from the GEF2 peptide (10). Note that both correlation matrices $\Delta ijfree\u2192bound$ and $\Delta ijbound\u2192free$ have *V* = 89 pairs of correlations for each residue, and we plot only the highest correlation value after combining the correlations from both matrices. We note that the strength of Δ_{ij} decreases systematically with distance from the GEF2 binding site. This is in contrast to the non-dependence of the strength of *C*_{ij} on distance (also shown in Fig. 10).

Now if we assume that Δ_{ij} serves as an indicator of the propagation of the regulatory signal, then it leads to the conclusion that the regulatory signal weakens with distance from the regulatory site. As such, and to the best of our knowledge, there are no propositions for indicators of regulatory signal transduction. Note that *C*_{ij} are indicators of how two sites in a protein communicate with each other in a specific state and are not reflective of how induced-changes in one site propagate to another site.

## CONCLUSIONS

Here we use the human PTP1E’s PDZ2 domain as a model system to address two outstanding questions in dynamic allostery. First, how do regulators modify inter-site correlations in conformational fluctuations (*C*_{ij})? Second, how are regulator-induced shifts in conformational ensembles at two different sites in a protein related to each other? We address these questions by analyzing correlations and causalities in high-dimensional conformational ensembles of the PDZ2 domain by using a combination of supervised and unsupervised machine learning approaches. In particular, we use affinity propagation to cluster *C*_{ij} matrices and also carry a flow analysis of weighted graphs of *C*_{ij}, which reveal that regulator binding significantly alters the global signaling network in the protein. More surprisingly, we find that regulator binding alters the global signaling network not by altering the spatial arrangement of strongly interacting residue clusters but by modifying the connectivity between and within clusters. We also develop a new supervised learning method to repartition high-dimensional conformational ensembles, which enables the computation of correlations in regulator-induced ensemble shifts Δ_{ij}. We find that Δ_{ij} are generally weaker than *C*_{ij}, but at the same time, there do exist distinct residue clusters whose Δ_{ij} are large. More importantly, we find that in contrast to *C*_{ij}, Δ_{ij} depends inversely on the distance from the regulator binding site. If we assume that Δ_{ij} is an indicator of the transduction of the regulatory signal, then it leads us to conclude that the regulatory signal weakens with distance from the regulatory site. We note that this assumption lacks a rigorous theoretical basis, but we also note that there exists no theory, nor any proposition in the literature for how regulatory signals are transmitted across proteins. Nevertheless, this finding motivates a further study of correlation patterns in other proteins regulated by dynamic allostery. Overall, this work provides new approaches to analyze high-dimensional molecular simulation data and also presents applications that yield new insight into dynamic allostery.

## ACKNOWLEDGMENTS

We acknowledge the use of the services provided by Research Computing at USF. This research was funded by USF Foundation and NSF Grant No. CHE-1531590.