Many proteins are regulated by dynamic allostery wherein regulator-induced changes in structure are comparable with thermal fluctuations. Consequently, understanding their mechanisms requires assessment of relationships between and within conformational ensembles of different states. Here we show how machine learning based approaches can be used to simplify this high-dimensional data mining task and also obtain mechanistic insight. In particular, we use these approaches to investigate two fundamental questions in dynamic allostery. First, how do regulators modify inter-site correlations in conformational fluctuations (Cij)? Second, how are regulator-induced shifts in conformational ensembles at two different sites in a protein related to each other? We address these questions in the context of the human protein tyrosine phosphatase 1E’s PDZ2 domain, which is a model protein for studying dynamic allostery. We use molecular dynamics to generate conformational ensembles of the PDZ2 domain in both the regulator-bound and regulator-free states. The employed protocol reproduces methyl deuterium order parameters from NMR. Results from unsupervised clustering of Cij combined with flow analyses of weighted graphs of Cij show that regulator binding significantly alters the global signaling network in the protein; however, not by altering the spatial arrangement of strongly interacting amino acid clusters but by modifying the connectivity between clusters. Additionally, we find that regulator-induced shifts in conformational ensembles, which we evaluate by repartitioning ensembles using supervised learning, are, in fact, correlated. This correlation Δij is less extensive compared to Cij, but in contrast to Cij, Δij depends inversely on the distance from the regulator binding site. Assuming that Δij is an indicator of the transduction of the regulatory signal leads to the conclusion that the regulatory signal weakens with distance from the regulatory site. Overall, this work provides new approaches to analyze high-dimensional molecular simulation data and also presents applications that yield new insight into dynamic allostery.

Regulation of protein activity is essential for normal cell functionality. Many proteins are regulated allosterically with spatial gaps between stimulation and active sites and biological stimuli that regulate them include, for example, small molecules, post-translational modifications, and changes in intensive state-variables like temperature and pH. These stimuli are known to not only switch activities on-and-off but also fine-tune them.

Understanding allosteric regulation of proteins requires not only assessment of the various individual states, including inactive and active states, but also relationships between states. For many proteins, atomically detailed models of regulation can be constructed in terms of how their representative (minimum-energy) structures differ between states.1–9 But in such cases, the representative structures differ significantly between states in that thermal fluctuations are negligible compared to structural differences. Such regulatory models, however, cannot be constructed when differences between representative structures are small and comparable to thermal fluctuations.10–18 Understanding such dynamic allosteric regulation requires assessment of conformational ensembles of individual states, as well as relationships between conformational ensembles of different states. This challenge faces several major protein families, including G protein coupled receptors, nuclear transcription factors, heat shock proteins, immune cell receptors, and viral attachment proteins.

Nevertheless, since energy drives structure and dynamics, molecular simulation techniques, which assess them as functions of energies, are capable of providing direct relationships between energy, structure, and dynamics. Many molecular simulation methods are available to obtain response functions of external stimuli and to generate conformational ensembles for protein states. However, constructing a molecular basis for any perturbative response, which is also key to making testable forward predictions, also requires analysis of conformational ensembles.

Toward this end, it has been shown that in a given state, correlations in equilibrium thermal fluctuations are related directly to the propensity for signal transduction.19 Given time-dependent conformations of two protein sites i and j in a specific state, fi(t) and fj(t), their fluctuation correlations are determined as

Cij=1σiσj01(fifi¯)(fjfj¯)dt,
(1)

where the bar denotes average and the σ denote fluctuations in individual sites. Many approaches, such as undirected graphs, have been employed for analyzing these Cij in the context of the spatial organization of the protein to obtain insight into how different spatial regions communicate with each other.19–30 These studies have provided valuable insight into molecular details underlying dynamic allostery. Nevertheless, Cij are computed from the active state of the protein, and it remains generally unexplored as to how regulators modify Cij and the resulting inter-site signal communication.

Understanding dynamic allostery also requires a statistical assessment of the differences between conformational ensembles of states. Toward this end, we and others have developed methods and metrics for quantifying differences between ensembles.31–37 These methods have been used to tease out protein regions affected by regulators, rank individual sites by the extent of their regulator-induced changes, and also statistically analyze the effect of different regulators on conformational ensembles, providing new mechanistic insight into dynamic allostery.31–37 

Nevertheless, the question that still remains unexplored is whether there exists a relationship between regulator-induced ensemble shifts at two different sites in a protein. From a phenomenological standpoint, proteins regulated by dynamic allostery are expected to exhibit a subset of conformational densities that are common to different states, and regulators control protein activity by simply introducing a bias on those densities.10–18 A proposition that follows directly from this governing principle is the possibility that perhaps regulator-induced density shifts at two sites in a protein may be correlated. In other words, if regulators bias the conformational densities of two sites i and j, that is, figi and fjgj, then the question is whether the fluctuations in shifts are also correlated. This requires us to determine whether ensemble shifts in this site, gi*fi¯, where gi*=gi\(figi), are related to ensemble shifts, gj*fj¯, of another site j. Analogous to (1), this relationship can be quantified by computing

Δij=1σi*σj*01(gi*fi¯)(gj*fj¯)dt,
(2)

where σ* denotes fluctuations in ensemble shifts.

In this work, we address this question as well the aforementioned question on how regulators affect Cij. We address them by first developing a supervised learning method to repartition high-dimensional conformational ensembles, which then enables the computation of Δij. We then use a combination of unsupervised machine learning and graph analysis techniques to analyze correlation data. We address these issues using, as a model system, the PDZ2 domain of human protein tyrosine phosphatase 1E (hPTP1E), which is one of the most well studied proteins in the context of dynamic allostery.38–44 In general, PDZ domains are characterized as protein-protein interaction modules and are found in abundance in various species. PDZ domains have been shown to use multiple surfaces for docking a diverse set of molecules, and these binding events can be allosterically regulated.44 As such, PDZ-containing proteins interact with multiple proteins within cells, and so understanding the regulation of PDZ enabled protein-protein interactions, including allostery, is vital to understanding their biology. In the case of the PDZ2 domain of hPTP1E, X-ray crystallography shows that the binding of the RA-GEF2 (Ras-association Guanine nucleotide Exchange Factor 2) terminal peptide induces a backbonal change smaller than 1 Å,45 a magnitude comparable to thermal fluctuations. This result is also supported by NMR chemical shift measurements.39 Figure 1 compares the conformational ensembles of PDZ2 in its GEF2-bound and GEF2-free states obtained using all atom molecular dynamics (MD) simulations (see the section titled Methods). Despite the several biophysical and mutational studies, including those mentioned above, the mechanism of PDZ2 stimulation by its natural regulator still remains largely unknown. The study we undertake here sheds new mechanistic light into PDZ2 regulation.

FIG. 1.

Superimposed conformational ensembles of the PDZ2 domain in the GEF2-free and GEF2-bound states. Each of the two conformational ensembles is represented by 11 snapshots taken at regular intervals from their respective molecular dynamics trajectories (see the section titled Methods). For the sake of clarity, the GEF2 peptide is not shown.

FIG. 1.

Superimposed conformational ensembles of the PDZ2 domain in the GEF2-free and GEF2-bound states. Each of the two conformational ensembles is represented by 11 snapshots taken at regular intervals from their respective molecular dynamics trajectories (see the section titled Methods). For the sake of clarity, the GEF2 peptide is not shown.

Close modal

We use molecular dynamics (MD) for generating conformational ensembles of PDZ2 in its GEF2-free and GEF2-bound states. The starting coordinates are taken from crystallographic structures deposited in the Protein Databank (PDB IDs: 3LNX and 3LNY).45 Hydrogens are added and their positions are optimized using PDB2PQR.46 The N- and C-termini of the GEF2 peptide and the protein are capped by adding acetyl (ACE) and N-methylamide (NME), respectively. The GEF2-free and the GEF2-bound structures are placed in cubic boxes containing ∼11 K water molecules, including those resolved crystallographically. KCl concentration is set at 75 mM, and there are extra K+ ions compared to Cl ions to compensate for the charge on the GEF2 peptide. MD simulations are carried out in duplicates (different random seeds for velocities) for both the GEF2-free and GEF2-bound states of PDZ2, and each MD simulation is 0.5 μs long. Backbone root mean square deviations (RMSDs) and potential energies are tracked separately for all simulations, and since RMSDs take 0.25 μs to level off for one trajectory, we discard the first 0.25 μs of each trajectory and use only the final 0.25 μs of each trajectory for analysis. Additional details on assessing convergence and construction of conformational ensembles are discussed in the section titled Results.

All four MD simulations are carried out under isothermal-isobaric boundary conditions, and using Gromacs version 4.5.3.47 Temperature is maintained at 298 K using an extended ensemble approach48,49 and with a coupling constant of 1 ps. An extended ensemble approach is also used for maintaining pressure.50 Pressure is maintained at 1 bar using a coupling constant of 1 ps and a compressibility of 4.5 × 10−5 bar−1. Electrostatic interactions are computed using the particle mesh Ewald scheme51 with a Fourier grid spacing of 0.1 nm, a fourth-order interpolation, and a direct space cutoff of 10 Å. van der Waals interactions are computed explicitly for interatomic distances ≤10 Å. The bonds in proteins and the geometries of water molecules are constrained,52,53 and consequently an integration time step of 2 fs is employed. The protein and ions are described using Amber99sb-ILDN parameters,54 and water molecules are described using SPC/E (Single Point Charge Extended) parameters.55 

If fi and fj represent the distributions of two sites in a protein, then their fluctuation correlations Cij are determined by (1). When the distribution is discrete and the data are vectors, that is, fi = {fi1, fi2, …, fim}, such as that obtained from molecular simulations, then (1) takes the following form:

Cij=1σiσj|(fifi¯).(fjfj¯)|,
(3)

where |…| denotes the absolute value. When applied to proteins in the context of constructing signaling networks, the vectors fik and fjk are generally those of Cα atoms or centers-of-mass (CoMs) of backbone atoms of two amino acids, i and j.19,20,22,24,26,27 Here, we take fik and fjk as the centers of mass of two amino acids i and j so that they also represent, although coarsely, the structural fluctuations in side chains. Including side chain fluctuations is important because GEF2 has been shown to affect the dynamics of PDZ2 side chains more than their respective backbones.39,45

If fi and gi represent two different distributions of the same site i, but under the influence of different external potentials, and if fj and gj represent the corresponding distribution of site j, then inter-site correlations in ensemble shifts Δij are given by (2). Calculation of Δij requires that the ensemble data, f and g, are repartitioned such that conformations corresponding to the overlap region fg are identified separately for each site and then removed from f and g to get the residuals f* and g*, respectively. Below we show that such a high-dimensional repartitioning task can be accomplished using the mathematical framework of support vector machines (SVMs). The development below follows from our SVM-based method to compute quantitative estimates for overlaps between conformational ensembles,34 which we also describe briefly for the sake of completeness.

SVMs are traditionally used for predicting binary classification of data.56–59 A SVM is first trained on a set of instances (x1, x2, …) with known group identities (y1, y2, … = {−1, +1}) and then the trained SVM is used for predicting the group identity of an unclassified instance. A SVM can also be constructed when the instances are 3N-dimensional molecular conformations (r) and belong to two ensembles, f = {r1, r2, …, rm} and g = {rm+1, rm+2, …, r2m}.60 Here we use the mathematical framework of SVMs, not for predicting classification but for repartitioning the pre-classified ensembles f and g to obtain the subsets f* and g*.

The training of the SVM is set up as a Lagrange optimization problem, where the goal is to maximize the linear separation between the two groups in some Hilbert space. Essentially, two hyperplanes,

yi(wrb)=1,
(4)

with yi = ±1 are sought that are constructed from a subset of the instances such that the distance 2/∥w∥ between the hyperplanes is maximized. This distance is maximized by minimizing

L=12w2k=12mαk[yk(wrkb)1]
(5)

with respect to ∥w∥ and b and maximizing it with respect to the Lagrange multipliers αk. Note that the square on ∥w∥ permits quadratic optimization and the 1/2 coefficient is introduced for mathematical convenience. Substituting the conditions

Lw=0    w=2mαkykrk
(6)

and

Lb=0    2mαkyk=0,
(7)

into Eq. (5) rearranges the auxiliary function to

L=2mαk12k,l2mαkαlykylK(rk,rl),
(8)

where K(rk, rl) = rk·rl, which is then maximized under the constraint ∑αkyk = 0, ∀ k. An additional box constraint is introduced, 0 ≤ αkC, in which C serves an upper limit on the magnitudes of the Lagrange multipliers.

Note that in the optimization of (8), the feature used for classifying r is its linear projection on other r. The vectors r, however, are generally not linearly separable in the Euclidean space when they represent molecular conformations. Such issues are dealt with by choosing alternative kernels that are, by themselves, inner products in the transformed Hilbert space,57–59 that is, K(rk,rl)=ϕ(rk),ϕ(rl). The primary advantage of such “kernel-tricks” is that they bypass the need to determine the explicit form of the function ϕ(r) that transforms the data from the original space to the desired Hilbert space to make the data linearly separable.

The optimization of (8) produces Lagrange multipliers that are ultimately plugged into the binary classifier, F(r)=i=12mαkykK(rk,r), to predict the group identities of previously unseen data r. More importantly, we note that the optimization of (8) produces two different sets of Lagrange multipliers, αk=0 and αk>0. The rk whose corresponding αk > 0 essentially define the maximum margin hyperplanes that are sought in (4), and these rk are referred to as support vectors.

We have demonstrated34,35,37 that a Kernel can be chosen such that total number of generated support vectors s, where 2 ≤ s < 2m, can be used as a quantitative estimate of the normalized overlap between the two distributions, that is,

Overlap=fg=s/2m.
(9)

The difference between the two ensembles can then be quantified in terms of a normalized metric η ∈ [0, 1),

ηfg=1fg,
(10)

which takes up a value closer to unity as the difference between ensembles increases. Specifically, we showed that if we choose a Gaussian radial distribution function as the Kernel, that is,

K(rk,rl)=exp(γrkrl2),
(11)

then the parameter γ and the box constraint C, which together define the Hilbert space, can be optimized to yield overlaps with high accuracy. Choosing γ = 0.4 Å−2 and C = 100 results in an error in η that is less than 6% for multi-modal distributions.37 This error can be reduced further by choosing higher upper limits for Lagrange multipliers; however, the subtle gain in accuracy requires a significant compromise in the algorithm’s efficiency.37 

The finding that the ratio s/2m accurately quantifies the overlap between ensembles suggests that the support vectors rk may actually be part of the overlap region (fg) in the Euclidean space. If this were true, then they can be simply removed from f and g, respectively, to obtain f* and g*. Figure 2(a) shows the distribution of support vectors in a test case of two partially overlapping 2D Gaussian distributions. Indeed, we find that the majority of the support vectors are part of the overlap region. However, a fraction of the support vectors does not belong to the overlap region but instead belong to f* and g*. This would imply that the ratio s/2m overestimates the overlap, and consequently the computed η is smaller than the analytical value. This is, in fact, what we noted previously34,35,37—for almost all of our test cases involving various distribution types (unimodal, bimodal, trimodal, and quadrimodal), we found that the computed η are systematically underestimated (<6%) with respect to exact values.

FIG. 2.

(a) Distribution of support vectors in a representative case of two partially overlapping 2D Gaussian distributions, f and g. Each of the two distributions comprise of m = 10 000 data points. The support vectors are indicated as blue dots, and the remaining instances in the two distributions, f* and g*, are colored gray and red, respectively. (b) Percent omission error in 50 random pairs of Gaussian distributions. It is computed as a ratio of the number of incorrectly assigned support vectors in the f* (and g*) region and the total instances that belong to the f* (and g*) region. In other words, omission error = FP/(TP + FP), where FP and TP are abbreviations for false positives and true positives, respectively.

FIG. 2.

(a) Distribution of support vectors in a representative case of two partially overlapping 2D Gaussian distributions, f and g. Each of the two distributions comprise of m = 10 000 data points. The support vectors are indicated as blue dots, and the remaining instances in the two distributions, f* and g*, are colored gray and red, respectively. (b) Percent omission error in 50 random pairs of Gaussian distributions. It is computed as a ratio of the number of incorrectly assigned support vectors in the f* (and g*) region and the total instances that belong to the f* (and g*) region. In other words, omission error = FP/(TP + FP), where FP and TP are abbreviations for false positives and true positives, respectively.

Close modal

Now if f* and g* were constructed by simply removing the support vectors from f and g, then f* and g* would, at worst, suffer from partial omissions of instances. More importantly, f* and g* will not be contaminated by instances belonging to fg. Figure 2(b) shows the average omission error in 50 random pairs of Gaussian distributions, one of which is shown in Fig. 2(a). We find that as the ensemble size (m) increases, the omission error reduces and for m ≥ 10 000, the average omission error is below 4%, and the worst case error is also below 7%, which are similar to errors we reported earlier37 in the estimation of η.

The support vectors can, therefore, be used to construct f* and g*, and without need for fitting the underlying distributions to assumed mathematical forms. f* and g* can be used to determine inter-site correlations in ensemble shifts. For discrete distributions and when the data are vectors, (2) takes the following form:

Δij=pijσi*σj*|(gi*fi¯).(gj*fj¯)|.
(12)

Note that the averaging is not performed over all conformations in gi* and gj*. Instead it runs only over a subset of protein conformations that are common to both gi* and gj*. Consequently, we introduce pij, which denotes the joint probability of finding protein conformations that are part of both gi* and gj*. Note also that fi and fj represent 3ni and 3nj dimension vectors, where ni and nj are the numbers of atoms in the two amino acids i and j. Consequently, the support vectors that are generated are representative of entire conformations of amino acids. After repartitioning conformational ensembles of amino acids, we then represent fi and fj by their respective CoMs and compute Δij.

Inter-site correlations Cij are combined with each other and also connected with the spatial organization of sites using undirected weighted graphs G(V, E), where V represents the number of nodes and E represents the edges that connect the nodes.19,20,24,26,27 Nodes on graphs represent points on the proteins that serve as receivers and/or transmitters of information in signaling pathways. Since signal transduction generally needs to be understood at the level of amino acids, nodes on graphs typically represent amino acids,19,20,24,26,27 and we define them as CoMs of amino acids. From a physical standpoint, direct signal communication is expected to occur between only those nodes whose conformational spaces are directly influenced by each other.19,24,26,27 To implement this, one typically measures the distance between the CoMs of two nodes, and if that distance is less than a predefined cutoff, which is generally in the range 4–6 Å,19,24,26,27 then the two nodes are connected by a edge. Otherwise, the two nodes remain unconnected. Instead of using cutoffs, we implement a parameter-free approach that uses the same physical logic, but determines connectivity on the basis of overlap between node volumes—if Γi is the volume of the conformational ensemble of node i, then it will be connected to node j only if Γi ∩ Γj > 0. Edge weights represent a quantity that tells us how nodes communicate with each other. We define edge weights as the inverse of the inter-site correlations. Therefore, the preferred allosteric path that connects any two nodes will be the shortest weighted path. We use Dijkstra’s algorithm61 implemented in Igraph62 to solve for shortest paths between all V(V − 1)/2 node pairs.

We generate duplicate MD trajectories of the PDZ2 domain in its GEF2-free and GEF2-bound states. Each trajectory is 0.5 μs long, and we use the second halves of these trajectories for analysis and for constructing conformational ensembles.

To determine whether the second halves of these trajectories provide adequate representations of conformational ensembles, we compute residue-wise differences between conformational ensembles constructed from duplicate trajectories. For each residue i in PDZ2, we determine ηi12, which is a quantitative estimate of the difference between their ensembles fi1 and fi2 taken from the duplicate trajectories. Note that η is bounded, that is, η ∈ [0, 1), and takes up a value closer to unity as the difference between ensembles increases. Each of the two duplicate ensembles f1 and f2 contain 5000 snapshots extracted at regular time intervals from their respective trajectories. Note also that prior to extracting the coordinates of a residue, the entire structure of PDZ2 is least-square fitted on to the starting structure, which removes the bias against whole molecule rotation and translation.34–37 

Figure 3 shows the cumulative distribution of residue-wise η computed separately for both the GEF2-free and GEF2-bound state ensembles. We find that 90% of the residues have η values smaller than 0.35, which is equivalent to a mean position difference of less than erf(0.35/2)=0.27 Å,35 showing that the differences between the duplicate trajectories are small. We also note that the η of a few residues, especially in the apo state, are large, but an inspection of residue identities reveals that they belong to the N- and C-termini of the PDZ2 domain. We exclude these terminal residues from further analysis.

FIG. 3.

Cumulative probability distribution of residue η1↔2 between duplicate trajectories.

FIG. 3.

Cumulative probability distribution of residue η1↔2 between duplicate trajectories.

Close modal

Instead of discarding the data from the duplicate trajectories, we combine the ensembles from the duplicate trajectories and create one representative 10 000-conformation ensemble for each of the GEF2-free (f) and GEF2-bound states (g). We then estimate the difference between these ensembles ηifg and compare them to the ηi12 estimated between duplicate trajectories. We find, in general, that ηifg>>ηi12, which shows that the statistical differences between duplicate trajectories are smaller than the differences between the GEF2-free and GEF2-bound ensembles. Together, this analysis shows that the latter half of the trajectories provide adequate representations of conformational ensembles of the two states.

The MD protocol reproduces well the methyl deuterium order parameters (Saxis2) obtained from NMR.39,63 Figure 4 compares the Saxis2 values computed from one of the two sets of trajectories to experimental estimates. The results are similar for the duplicate set of trajectories. We compute Saxis2 from MD trajectories by modeling the autocorrelation function64 

A(t)=12(3μ^(0)μ^(t)21)
(13)

as an exponential decay

A(t)=Saxis2+(1Saxis2)et/τ,
(14)

an assumption also used in estimating order parameters from NMR spectral densities. In the expressions above, μ^ are the unit vectors of C–C bonds in which the latter carbon is part of the methyl group, and τ is the relaxation time. We assume here that the order parameters of the C–C(H3) bonds represent those of the hydrogens in the CH3 groups.

FIG. 4.

(a) Methyl deuterium order parameter (Saxis2) computed from the final 250 ns of MD compare well with those estimated from NMR.63ρ denotes the Pearson correlation coefficient between the computed and experimental Saxis2 values. (b) Distribution of statistical error in estimating (Saxis2) from MD, determined from block averaging, which was carried out by progressively increasing sampling size.36,65 Note that for almost all cases the block averaging error <0.05, indicating that the Saxis2 values are statistically converged.

FIG. 4.

(a) Methyl deuterium order parameter (Saxis2) computed from the final 250 ns of MD compare well with those estimated from NMR.63ρ denotes the Pearson correlation coefficient between the computed and experimental Saxis2 values. (b) Distribution of statistical error in estimating (Saxis2) from MD, determined from block averaging, which was carried out by progressively increasing sampling size.36,65 Note that for almost all cases the block averaging error <0.05, indicating that the Saxis2 values are statistically converged.

Close modal

Figure 5 compares the GEF2-induced shifts in residue centers-of-mass (CoMs) and root mean square fluctuations (RMSFs). In general, we note that GEF2 affects the structure and dynamics of residue side chains more than their respective backbones, a result consistent with previous studies.39,45 Such a form of induced changes has contributed to the prevailing hypothesis39 that allosteric regulation in PDZ2 occurs primarily through changes in side chain structure and dynamics. However, since there is no formal relationship between the extent of induced shifts and signaling propensity, contributions from residues undergoing relatively small changes in backbone structure and dynamics should not be ignored.

FIG. 5.

GEF2-induced shifts (Å) in residue (a) centers-of-mass (ΔCoMs) and (b) root mean square fluctuations (ΔRMSFs). GEF2 affects the structure and dynamics of residue side chains more than their respective backbones. The inset in (a) compares the conformational ensembles (11 equally spaced representative snapshots) of R79, the residue whose side chain undergoes the highest ΔCoM. The inset in (b) compares the conformational ensembles of S29, the residue that undergoes the highest ΔRMSF.

FIG. 5.

GEF2-induced shifts (Å) in residue (a) centers-of-mass (ΔCoMs) and (b) root mean square fluctuations (ΔRMSFs). GEF2 affects the structure and dynamics of residue side chains more than their respective backbones. The inset in (a) compares the conformational ensembles (11 equally spaced representative snapshots) of R79, the residue whose side chain undergoes the highest ΔCoM. The inset in (b) compares the conformational ensembles of S29, the residue that undergoes the highest ΔRMSF.

Close modal

Using (3), we compute separately inter-site fluctuation correlations in the GEF2-free state (Cijfree) and the GEF2-bound state (Cijbound). These correlations are shown in Fig. 6. A visual inspection of Cij suggests that GEF2 binding does not alter the overall correlation pattern. To analyze this quantitatively, we cluster the Cij matrices using affinity propagation.66 Unlike most other clustering algorithms, affinity propagation is an unsupervised machine learning algorithm that does not assume a priori the number of clusters or a cutoff value for delineating clusters. Instead, the algorithm takes only as input the measures of similarity between pairs of data points, which we define as the inverse of the correlation, that is, 1/Cij. We test for convergence by progressively doubling the optimization steps and checking for constancy. Affinity propagation yields almost identical clusters for the GEF2-free and GEF2-bound states with minor differences limited to the residues in the N- and C-termini of the PDZ2 domain, confirming that GEF2 binding indeed does not alter the overall correlation pattern.

FIG. 6.

Heat maps of inter-site correlations in thermal fluctuations Cijfree and Cijbound in the GEF2-free and GEF2-bound states, respectively. Note that Cii is set to zero and the remaining correlations are normalized by dividing each set by their respective highest values.

FIG. 6.

Heat maps of inter-site correlations in thermal fluctuations Cijfree and Cijbound in the GEF2-free and GEF2-bound states, respectively. Note that Cii is set to zero and the remaining correlations are normalized by dividing each set by their respective highest values.

Close modal

Inter-site signal communication, however, does not solely depend on clustering patterns in Cij. It also depends on connectivity between and within clusters.19–30 To further examine the effect of GEF2 binding the inter-site signaling we use Cij to construct 2D undirected graphs and carry out a flow analysis of the graph.19,24,26,27 The nodes on the graph represent residues, and edge weights in the graph are taken 1/Cij, implying that the optimal connectivity between two nodes will be given by the shortest weighted path connecting the nodes. We connect only those nodes by edges that physically interact with each other. Typically, two residues are connected directly if the distances between the average CoMs of their corresponding residues are less than a predefined cutoff, which is generally around 5 Å.19,24,26,27 Instead of using pre-defined cutoffs, we compute the overlap between the volumes of residue conformational ensembles, and two residues (nodes) are considered to physically interact (connected) if their volume overlap is non-zero.

We construct two such graphs, Gfree using residue ensembles and correlations in the GEF2-free state and Gbound using correlations and residue ensembles in the GEF2-bound state. Note that while the two graphs have the same number of nodes (residues) V = 89, they have different numbers of connected edges (interacting residue pairs)—Gfree has 810 edges and Gbound has 714 edges. On average, each node in Gfree has 9.1 edges and each node in Gbound has 8.0 edges. We then solve for shortest weighted paths between all V(V − 1)/2 node pairs. After solving for shortest paths, we count how many times each node appears in the V(V − 1)/2 shortest paths, Ωi. Note that (V − 1) ≤ ΩiV(V − 1)/2. We do this separately for each of the two graphs, and so for each graph, we obtain a separate set of node-occurrences {Ωi}. As before,19,24,26,27 we assume that a residue that has a higher Ω contributes more to allosteric signaling, and so we define Ω to be a residue’s signaling propensity.

For each graph, we rank separately its residues in decreasing order of their signaling propensity Ω. This yields, for each of the two graphs, an ordered set of signaling propensities. We find that the Pearson correlation between residue ranks in the two graphs is small (ρ = 0.18). The correlation is even smaller if only the top ranked (25%) residues are considered (Fig. 7). Now even if the ordering in the top ranked residues is ignored, we find that the pairwise identity overlaps between the two states are about 50%. This implies that if a residue contributes significantly to signaling in the GEF2-free state, it does not necessarily imply that it will also contribute to signaling in the GEF2-bound state. The residue list in Fig. 7 is also color-coded to highlight NMR order parameter changes (|ΔS2|) effected by GEF2 binding. We note no apparent relationship between the signaling propensity of a residue and the |ΔS2| it undergoes. As such, there is no formal relationship between the extent of induced shifts and signaling propensity, yet several comparisons between the two exist in the literature.

FIG. 7.

List of top ranked (25%) residues in decreasing order of their signaling propensities, Ω. The residues are color-coded to highlight whether their NMR order parameters (S2) change upon GEF2 binding.39,63σ is the experimental standard error.

FIG. 7.

List of top ranked (25%) residues in decreasing order of their signaling propensities, Ω. The residues are color-coded to highlight whether their NMR order parameters (S2) change upon GEF2 binding.39,63σ is the experimental standard error.

Close modal

Taken together with results from unsupervised clustering, we find that GEF2 binding significantly alters the global signaling network in the protein, however, not by altering the spatial arrangement of strongly interacting residue clusters, but by modifying the connectivity between and within clusters.

To enable computation of Δij using (12), we need to repartition the GEF2-free (f) and GEF2-bound (g) conformational ensembles of each amino acid such that conformations corresponding to their overlap region fg are identified and then removed from f and g to get the residuals f* and g*, respectively. We accomplish this task using supervised machine learning as detailed in the section titled Methods. Figure 8 shows the repartitioning results for two representative amino acids, S29 and R79, that undergo, respectively, the highest GEF2-induced changes in CoM and RMSF (see Fig. 5). A visual inspection indicates that most of the support vectors are part of the overlap region. However, there are some support vectors that are not part of the physical overlap region and are fencing the individual f and g ensembles. These conformations will end up being erroneously removed from f and g during the construction of f* and g*, but our analysis in Fig. 2 indicates that this omission error will be small.

FIG. 8.

Repartitioned conformational ensembles (500 equally spaced snapshots) of two representative amino acids. The conformations that are support vectors are drawn in blue.

FIG. 8.

Repartitioned conformational ensembles (500 equally spaced snapshots) of two representative amino acids. The conformations that are support vectors are drawn in blue.

Close modal

We compute Δij in two ways: (a) Δijfreebound, which is the correlation in shift in the GEF2-bound state with respect to the GEF2-free state, and (b) Δijboundfree, which is the correlation in shift in the GEF2-free state with respect to the GEF2-bound state. Δijfreebound and Δijboundfree are expected to be identical only if the conformational densities in the GEF2-free and GEF2-bound states for both residues i and j are symmetric about their interface.

Figure 9 shows these correlations. We note first that Δij are generally weaker than Cij, but at the same time, there do exist distinct residue clusters whose Δij are large. The clustering patterns in Δijfreebound and Δijboundfree, as determined from affinity propagation,66 have some similarities, but are not identical, highlighting that the conformational densities in the GEF2-free and GEF2-bound states are not always symmetric about their interfaces.

FIG. 9.

Heat maps of inter-site correlations in ensemble shifts Δijfreebound and Δijboundfree. Note that Δii is set to zero and the remaining correlations are normalized by dividing each set by their respective highest values.

FIG. 9.

Heat maps of inter-site correlations in ensemble shifts Δijfreebound and Δijboundfree. Note that Δii is set to zero and the remaining correlations are normalized by dividing each set by their respective highest values.

Close modal

To examine the clustering patterns in Δij, we plot the highest Δij of each residue as a function of distance from the GEF2 peptide (10). Note that both correlation matrices Δijfreebound and Δijboundfree have V = 89 pairs of correlations for each residue, and we plot only the highest correlation value after combining the correlations from both matrices. We note that the strength of Δij decreases systematically with distance from the GEF2 binding site. This is in contrast to the non-dependence of the strength of Cij on distance (also shown in Fig. 10).

FIG. 10.

Relationship between the highest Δij of a PDZ residue and its distance (d) from the GEF2 peptide. d is the shortest distance between the PDZ residue and the GEF2 peptide and is measured between Cα atoms. The highest Cij of each PDZ residue are also shown as unfilled orange circles.

FIG. 10.

Relationship between the highest Δij of a PDZ residue and its distance (d) from the GEF2 peptide. d is the shortest distance between the PDZ residue and the GEF2 peptide and is measured between Cα atoms. The highest Cij of each PDZ residue are also shown as unfilled orange circles.

Close modal

Now if we assume that Δij serves as an indicator of the propagation of the regulatory signal, then it leads to the conclusion that the regulatory signal weakens with distance from the regulatory site. As such, and to the best of our knowledge, there are no propositions for indicators of regulatory signal transduction. Note that Cij are indicators of how two sites in a protein communicate with each other in a specific state and are not reflective of how induced-changes in one site propagate to another site.

Here we use the human PTP1E’s PDZ2 domain as a model system to address two outstanding questions in dynamic allostery. First, how do regulators modify inter-site correlations in conformational fluctuations (Cij)? Second, how are regulator-induced shifts in conformational ensembles at two different sites in a protein related to each other? We address these questions by analyzing correlations and causalities in high-dimensional conformational ensembles of the PDZ2 domain by using a combination of supervised and unsupervised machine learning approaches. In particular, we use affinity propagation to cluster Cij matrices and also carry a flow analysis of weighted graphs of Cij, which reveal that regulator binding significantly alters the global signaling network in the protein. More surprisingly, we find that regulator binding alters the global signaling network not by altering the spatial arrangement of strongly interacting residue clusters but by modifying the connectivity between and within clusters. We also develop a new supervised learning method to repartition high-dimensional conformational ensembles, which enables the computation of correlations in regulator-induced ensemble shifts Δij. We find that Δij are generally weaker than Cij, but at the same time, there do exist distinct residue clusters whose Δij are large. More importantly, we find that in contrast to Cij, Δij depends inversely on the distance from the regulator binding site. If we assume that Δij is an indicator of the transduction of the regulatory signal, then it leads us to conclude that the regulatory signal weakens with distance from the regulatory site. We note that this assumption lacks a rigorous theoretical basis, but we also note that there exists no theory, nor any proposition in the literature for how regulatory signals are transmitted across proteins. Nevertheless, this finding motivates a further study of correlation patterns in other proteins regulated by dynamic allostery. Overall, this work provides new approaches to analyze high-dimensional molecular simulation data and also presents applications that yield new insight into dynamic allostery.

We acknowledge the use of the services provided by Research Computing at USF. This research was funded by USF Foundation and NSF Grant No. CHE-1531590.

1.
L.
Yang
,
G.
Song
, and
R. L.
Jernigan
,
Proc. Natl. Acad. Sci. U. S. A.
106
,
12347
12352
(
2009
).
2.
M. D.
Daily
and
J. J.
Gray
,
Proteins: Struct., Funct., Bioinf.
67
,
385
399
(
2007
).
3.
M. D.
Daily
and
J. J.
Gray
,
PLoS Comput. Biol.
5
,
e1000293
(
2009
).
4.
A. V.
Karginov
,
F.
Ding
,
P.
Kota
,
N. V.
Dokholyan
, and
K. M.
Hahn
,
Nat. Biotechnol.
28
,
743
747
(
2010
).
5.
J. A.
Brannigan
and
A. J.
Wilkinson
,
Nat. Rev. Mol. Cell Biol.
3
,
964
970
(
2002
).
6.
V.
Nanda
and
W. F.
DeGrado
,
J. Am. Chem. Soc.
128
,
809
816
(
2006
).
7.
D. J.
Mandell
and
T.
Kortemme
,
Nat. Chem. Biol.
5
,
797
807
(
2009
).
8.
R. J.
Pantazes
,
M. J.
Grisewood
, and
C. D.
Maranas
,
Curr. Opin. Struct. Biol.
21
,
467
472
(
2011
).
9.
B. J.
Grant
,
A. A.
Gorfe
, and
J. A.
McCammon
,
Curr. Opin. Struct. Biol.
20
,
142
147
(
2010
).
10.
A.
Cooper
and
D.
Dryden
,
Eur. Biophys. J.
11
,
103
109
(
1984
).
11.
K.
Henzler-Wildman
and
D.
Kern
,
Nature
450
,
964
972
(
2007
).
12.
Q.
Cui
and
M.
Karplus
,
Protein Sci.
17
,
1295
1307
(
2008
).
13.
I.
Bahar
,
T. R.
Lezon
,
L.-W.
Yang
, and
E.
Eyal
,
Annu. Rev. Biophys.
39
,
23
42
(
2010
).
14.
S.-R.
Tzeng
and
C. G.
Kalodimos
,
Curr. Opin. Struct. Biol.
21
,
62
67
(
2011
).
15.
H. N.
Motlagh
,
J. O.
Wrabl
,
J.
Li
, and
V. J.
Hilser
,
Nature
508
,
331
339
(
2014
).
16.
C.-J.
Tsai
and
R.
Nussinov
,
PLoS Comput. Biol.
10
,
e1003394
(
2014
).
17.
K. H.
DuBay
,
G. R.
Bowman
, and
P. L.
Geissler
,
Acc. Chem. Res.
48
,
1098
1105
(
2015
).
18.
J.
Guo
and
H.-X.
Zhou
,
Chem. Rev.
116
,
6503
6515
(
2016
).
19.
C.
Chennubhotla
and
I.
Bahar
,
PLoS Comput. Biol.
3
,
e172
(
2007
).
20.
K.
Brinda
and
S.
Vishveshwara
,
Biophys. J.
89
,
4159
4170
(
2005
).
21.
R. E.
Amaro
,
A.
Sethi
,
R. S.
Myers
,
V. J.
Davisson
, and
Z. A.
Luthey-Schulten
,
Biochemistry
46
,
2156
2173
(
2007
).
22.
Y.
Kong
and
M.
Karplus
,
Structure
15
,
611
623
(
2007
).
23.
C.-P.
Lin
,
S.-W.
Huang
,
Y.-L.
Lai
,
S.-C.
Yen
,
C.-H.
Shih
,
C.-H.
Lu
,
C.-C.
Huang
, and
J.-K.
Hwang
,
Proteins: Struct., Funct., Bioinf.
72
,
929
935
(
2008
).
24.
T.
Lin
and
G.
Song
, “
Predicting allosteric communication pathways using motion correlation network
,” in
Proceedings of the 7th Asia Pacific Bioinformatics Conference (APBC), Tsinghua University, China
(
2009
), pp.
588
598
.
25.
W.
Stacklies
,
F.
Xia
, and
F.
Gräter
,
PLoS Comput. Biol.
5
,
e1000574
(
2009
).
26.
A.
Ghosh
,
R.
Sakaguchi
,
C.
Liu
,
S.
Vishveshwara
, and
Y.-M.
Hou
,
J. Biol. Chem.
286
,
37721
37731
(
2011
).
27.
A.
del Sol
,
H.
Fujihashi
,
D.
Amoros
, and
R.
Nussinov
,
Mol. Syst. Biol.
2
,
19
(
2006
).
28.
C.
Atilgan
and
A. R.
Atilgan
,
PLoS Comput. Biol.
5
,
e1000544
(
2009
).
29.
C.
Atilgan
,
Z.
Gerek
,
S.
Ozkan
, and
A.
Atilgan
,
Biophys. J.
99
,
933
943
(
2010
).
30.
D. U.
Ferreiro
,
J. A.
Hegler
,
E. A.
Komives
, and
P. G.
Wolynes
,
Proc. Natl. Acad. Sci. U. S. A.
108
,
3499
3503
(
2011
).
31.
K.
Lindorff-Larsen
and
J.
Ferkinghoff-Borg
,
PLoS One
4
,
e4203
(
2009
).
32.
K. C.
Wolfe
and
G. S.
Chirikjian
,
Entropy
14
,
213
232
(
2012
).
33.
C. L.
McClendon
,
L.
Hua
, and
M. P.
Jacobson
,
J. Chem. Theory Comput.
8
,
2115
(
2012
).
34.
R. E.
Leighty
and
S.
Varma
,
J. Chem. Theory Comput.
9
,
868
875
(
2013
).
35.
S.
Varma
,
M.
Botlani
, and
R. E.
Leighty
,
Proteins: Struct., Funct., Bioinf.
82
,
3241
3254
(
2014
).
36.
P.
Dutta
,
M.
Botlani
, and
S.
Varma
,
J. Phys. Chem. B
118
,
14795
14807
(
2014
).
37.
P.
Dutta
,
A.
Siddiqui
,
M.
Botlani
, and
S.
Varma
,
Biophys. J.
111
,
1621
1630
(
2016
).
38.
I.
Bezprozvanny
and
A.
Maximov
,
Proc. Natl. Acad. Sci. U. S. A.
98
,
787
789
(
2001
).
39.
E. J.
Fuentes
,
C. J.
Der
, and
A. L.
Lee
,
J. Mol. Biol.
335
,
1105
1115
(
2004
).
40.
E.
Kim
and
M.
Sheng
,
Nat. Rev. Neurosci.
5
,
771
781
(
2004
).
41.
S.
Gianni
,
T.
Walma
,
A.
Arcovito
,
N.
Calosci
,
A.
Bellelli
,
Å.
Engström
,
C.
Travaglini-Allocatelli
,
M.
Brunori
,
P.
Jemth
, and
G. W.
Vuister
,
Structure
14
,
1801
1809
(
2006
).
42.
C. M.
Petit
,
J.
Zhang
,
P. J.
Sapienza
,
E. J.
Fuentes
, and
A. L.
Lee
,
Proc. Natl. Acad. Sci. U. S. A.
106
,
18249
18254
(
2009
).
43.
A. B.
Law
,
E. J.
Fuentes
, and
A. L.
Lee
,
J. Am. Chem. Soc.
131
,
6322
(
2009
).
44.
H.-J.
Lee
and
J. J.
Zheng
,
Cell Commun. Signaling
8
,
8
(
2010
).
45.
J.
Zhang
,
P. J.
Sapienza
,
H.
Ke
,
A.
Chang
,
S. R.
Hengel
,
H.
Wang
,
G. N.
Phillips
, Jr.
, and
A. L.
Lee
,
Biochemistry
49
,
9280
(
2010
).
46.
T. J.
Dolinsky
,
P.
Czodrowski
,
H.
Li
,
J. E.
Nielsen
,
J. H.
Jensen
,
G.
Klebe
, and
N. A.
Baker
,
Nucleic Acids Res.
35
,
W522
W525
(
2007
).
47.
B.
Hess
,
C.
Kutzner
,
D.
Van Der Spoel
, and
E.
Lindahl
,
J. Chem. Theory Comput.
4
,
435
447
(
2008
).
48.
49.
W. G.
Hoover
,
Phys. Rev. A
31
,
1695
(
1985
).
50.
M.
Parrinello
and
A.
Rahman
,
J. Appl. Phys.
52
,
7182
7190
(
1981
).
51.
T.
Darden
,
D.
York
, and
L.
Pedersen
,
J. Chem. Phys.
98
,
10089
10092
(
1993
).
52.
B.
Hess
,
J. Chem. Theory Comput.
4
,
116
122
(
2008
).
53.
S.
Miyamoto
and
P. A.
Kollman
,
J. Comput. Chem.
13
,
952
962
(
1992
).
54.
K.
Lindorff-Larsen
,
S.
Piana
,
K.
Palmo
,
P.
Maragakis
,
J. L.
Klepeis
,
R. O.
Dror
, and
D. E.
Shaw
,
Proteins: Struct., Funct., Bioinf.
78
,
1950
1958
(
2010
).
55.
H.
Berendsen
,
J.
Grigera
, and
T.
Straatsma
,
J. Phys. Chem.
91
,
6269
6271
(
1987
).
56.
C.
Cortes
and
V.
Vapnik
,
Mach. Learn.
20
,
273
297
(
1995
).
57.
A. J.
Smola
,
B.
Schölkopf
, and
K.-R.
Müller
,
Neural Networks
11
,
637
649
(
1998
).
58.
B.
Schölkopf
,
C. J.
Burges
, and
A. J.
Smola
,
Advances in Kernel Methods: Support Vector Learning
(
The MIT Press
,
1998
),
p. 386
.
59.
N.
Cristianini
and
J.
Shawe-Taylor
,
An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods
(
Cambridge University Press
,
2000
),
p. 204
.
60.
R. N.
Jorissen
and
M. K.
Gilson
,
J. Chem. Inf. Model.
45
,
549
561
(
2005
).
61.
E. W.
Dijkstra
,
Numer. Math.
1
,
269
271
(
1959
).
62.
G.
Csardi
and
T.
Nepusz
,
Int. J. Complex. Syst.
1695
,
1
9
(
2006
).
63.
E. J.
Fuentes
,
S. A.
Gilmore
,
R. V.
Mauldin
, and
A. L.
Lee
,
J. Mol. Biol.
364
,
337
351
(
2006
).
64.
G.
Lipari
and
A.
Szabo
,
J. Am. Chem. Soc.
104
,
4546
4559
(
1982
).
65.
H.
Flyvbjerg
and
H. G.
Petersen
,
J. Chem. Phys.
91
,
461
466
(
1989
).
66.
B. J.
Frey
and
D.
Dueck
,
Science
315
,
972
976
(
2007
).