The goal of this protocol is to reveal structural dynamics of one-dimensional diffusion of protein along DNA, using a plant transcription factor WRKY domain protein as an exemplary system. To do this, both atomistic and coarse-grained molecular dynamics simulations along with extensive computational samplings have been implemented.
One-dimensional (1-D) sliding of transcription factor (TF) protein along DNA is essential for facilitated diffusion of the TF to locate target DNA site for genetic regulation. Detecting base-pair (bp) resolution of the TF sliding or stepping on the DNA is still experimentally challenging. We have recently performed all-atom molecular dynamics (MD) simulations capturing spontaneous 1-bp stepping of a small WRKY domain TF protein along DNA. Based on the 10 µs WRKY stepping path obtained from such simulations, the protocol here shows how to conduct more extensive conformational samplings of the TF-DNA systems, by constructing the Markov state model (MSM) for the 1-bp protein stepping, with various numbers of micro- and macro-states tested for the MSM construction. In order to examine processive 1-D diffusional search of the TF protein along DNA with structural basis, the protocol further shows how to conduct coarse-grained (CG) MD simulations to sample long-time scale dynamics of the system. Such CG modeling and simulations are particularly useful to reveal the protein-DNA electrostatic impacts on the processive diffusional motions of the TF protein above tens of microseconds, in comparison with sub-microseconds to microseconds protein stepping motions revealed from the all-atom simulations.
Transcription factors (TF) search for the target DNA to bind and regulate gene transcription and related activities1. Aside from the three-dimensional (3D) diffusion, the facilitated diffusion of TF has been suggested to be essential for target DNA search, in which the proteins can also slide or hop along one-dimensional (1D) DNA, or jump with intersegmental transfer on the DNA2,3,4,5,6,7.
In a recent study, we have conducted tens of microseconds (µs) all-atom equilibrium molecular dynamics (MD) simulations on a plant TF – the WRKY domain protein on the DNA8. A complete 1-bp stepping of WRKY on poly-A DNA within microseconds has been captured. The movements of the protein along the DNA groove and hydrogen bonds (HBs) breaking-reforming dynamics have been observed. While such a trajectory represents one sampled path, an overall protein stepping landscape is still lack of. Here, we show how to expand computational samplings around the initially captured protein stepping path with the constructed Markov state model (MSM), which have been implemented widely for simulating a variety of biomolecular systems involving substantial conformational changes and time-scale separation9,10,11,12,13,14,15,16,17,18,19. The purpose is to reveal the conformational ensemble and meta-stable states of the TF protein diffusion along DNA for one cyclic step.
While the above MD simulation reveals atomic resolution of the protein movements for 1 bp on the DNA, the structural dynamics of long-time processive diffusion of the TF along DNA at the same high-resolution is hardly accessible. Conducting coarse-grained (CG) MD simulations at residue level is however technically approachable. The CG simulation time scale can be effectively extended to tens or hundreds of times longer than the atomic simulations20,21,22,23,24,25,26,27,28,29. Here, we show the CG simulations conducted by implementing the CafeMol software developed by Takada lab30.
In current protocol, we present the atomic simulations of the WRKY domain protein along poly-A DNA and the MSM construction first, which focus on sampling the protein stepping motions for only 1 bp along DNA. Then we present the CG modeling and simulations of the same protein-DNA system, which extend the computational sampling to the protein processive diffusion over tens of bps along DNA.
Here, we use GROMACS31,32,33 software to conduct MD simulations and MSMbuilder34 to construct the MSM for sampled conformational snapshots, as well as to use VMD35 to visualize the biomolecules. The protocol requires that the user to be able to install and implement the software above. The installation and implementation of the CafeMol30 software is then necessary for conducting the CG MD simulations. Further analyses of the trajectories and visualization are also conducted in VMD.
1. Construction of the Markov state model (MSM) from atomic MD simulations
2. Conducting coarse-grained (CG) simulation to sample long-time dynamics
Rotation-coupled sliding or 1 bp stepping of WRKY from the MSM construction
All protein conformations on the DNA are mapped to the longitudinal movement X and rotation angle of the protein COM along DNA (see Figure 3A). The linear coupling of these two degrees indicates rotation-coupled stepping of the WRKY domain protein on the DNA. The conformations can be further clustered into 3 macrostates (S1, S2, and S3) in the MSM. The forward stepping of WRKY then follows the macrostate transition S1->S2->S3. S1 refers to a metastable state initiated by the modeled structure (based on the crystal structure of WRKY-DNA complex40), with a population of ~ 6%. Note that in current modeling, the initial protein conformation was adopted from the crystal structure in which the protein binds with specific W-box DNA sequence40. Such a modeled protein-poly A-DNA complex thus leads to less favorable initial structures (S1) than the stepped or finally relaxed structures (S3). Nevertheless, one can find that the hydrogen bonds (HBs) at the protein-DNA interface recover near the center of S3 as that near the center in S1 (see Figure 3B). The HBs in the S1 state are well maintained: K125 with A15, R131, Q146 and Y133 with A16, K144 and Y119 with A17, R135 with A18 (Figure 3B top left). S3 refers to a metastable state after the 1-bp protein stepping, with almost all the HBs shifted for 1-bp distance (Figure 3B bottom), and the structures appear stable with the highest population (63%). The intermediate state S2 connects S1 and S3, with a medium-high population (~30%). We found that the R135 and K144 are quite flexible in this intermediate state and can usually break HBs with the current nucleotide and reform that with the next nucleotide (Figure 3B top right). Overall, the WRKY protein COM moved ~2.9 Å and rotated ~55° to stepping 1 bp here. The rate-limiting step for the WRKY stepping is S2->S3, which essentially allows collective breaking and reforming of the HBs and requires ~7 µs on average. In contrast, S1 to S2 can transit very fast at a time of ~0.06 µs or 60-ns (Figure 3B), involving mainly the protein COM fluctuations (e.g., due to protein orientational changes on the DNA).
Single-strand bias of WRKY during processive diffusion in the CG model
In our recent study, we found that the WRKY domain protein binds preferentially to one strand of the dsDNA, no matter during 1-bp stepping or static binding; and the single-strand bias becomes highly prominant particularly upon specific DNA sequence binding8. Meanwhile, it is not clear whether such a trend remains during the processive diffusion of the protein along DNA. Here we tried to examine the potential strand bias via the CG simulations. Interestingly, a significant single-strand DNA binding configuration has been identified in the CG simulations of the WRKY during processive diffusion. To see that, the contact numbers between protein and DNA were calculated on the respective DNA strands (see Figure 4B). A contact is considered when the distance between protein CG particle and DNA CG P (phosphate group) particle is smaller than 7 Å. The protein indeed shows bias to one of the DNA strands (e.g., ~4 contacts to one strand and ~1 contact to the other), i.e., even when detailed interactions such as HBs at the protein-DNA interface are not modeled.
The preferred DNA strand, however, can switch from time to time between the two strands of the DNA, depending on the binding orientation or configuration of the protein on the DNA. In particular, according to the contact number formed between the protein and respective strands of DNA, there are mainly 4 states here (as labeled 1, 2, 3, and 4 in Figure 4B,C). In state 1 and 3, a zinc-finger region binds toward -Y direction, and the preferred strand is the blue one. In state 2 and 3, the zinc-finger region binds toward +Y direction, and the preferred strand becomes the red one. It is also found that the zinc-figner region interacts dominantly with the DNA (see Figure 4D). Hence, the DNA strand bound closely with the zinc-finger region is indeed the preferred one. According to the above sampling, it thus appears that the strand bias persists but switches between the two DNA strands in the CG model of the processive protein diffusion.
Protein individual residual stepping in the CG simulations
It was previously noticed from our CG simulations that the stepping size of WRKY may vary on different DNA sequences8. The protein COM tends to step 1 bp on the homogeneous poly-A DNA. While on poly-AT DNA with 2 bp periodicity, the proportion of 2-bp stepping seems to increase.
Additionally, here we examined whether individual protein residues move synchronously at the protein-DNA interface. We calculated the stepping size of each highly conserved residue in the WRKY motif (WRKYGQK) for every 1000 timesteps (Figure 5A). The residual stepping size of each conserved residue can thus be measured from the CG simulations. The results indeed show that the stepping sizes of these individual residues are more synchronized on poly-A DNA than on poly-AT or random DNA sequences (Figure 5B).
Figure 1: The conformations generation and microstates/macrostates construction. (A) The initial forward stepping path mapped on the protein-DNA RMSD and protein rotational angle around the DNA. The initial chosen 25 structures are labeled by red circles. (B) The 100 conformation cluster centers from the 1st round 25 x 50 ns MD simulation trajectories mapped on the two highest eigenvalue tICs direction. (C) Plots of the implied timescale as a function of lag-time for the MSM construction via tICA using chosen distance pairs as input. For each set, MSM was constructed by projecting the conformations onto the top 2 tICs followed by K-centers clustering to produce 20 to 2000 microstates (from left to right column) with correlation delay time for tICA chosen from 5 to 40 ns (from top to bottom row). (D) The 500 microstates constructed and (E) the further constructed 3 macrostates, with corresponding microstate centers mapped along the highest two tICs direction. Please click here to view a larger version of this figure.
Figure 2: Construction of the macrostates. (A) The mapping of initial forward stepping path trajectory (left) and with a small number of additional micro-second trajectory samplings (right) on the protein center of mass (COM) movement along DNA long axis (X) and rotational angle around the DNA (obtained previously8). (B) The mapping of the original 100 × 50 ns trajectories and the 97 × 50 ns trajectories used in current MSM construction. (C) The construction of 3-6 macrostates and their populations from the constructed MSM are labeled on the extensive sampling maps. (D) The protein movement X and rotation angle around DNA are shown, respectively. The sampled conformations are finally lumped into 3 macrostates, with red, blue, and gray corresponding to the macrostate 1, 2, and 3, respectively. Please click here to view a larger version of this figure.
Figure 3: The MSM of the WRKY domain protein stepping on poly-A DNA. (A) The projection of the MD conformational snapshots onto coordinates of the protein COM movement X and rotational angle with respect to the DNA. The 3 macrostates S1, S2, and S3 are colored in red, blue, and gray, respectively. (B) Representative conformations and transition mean-first-passage-time (MFPT) of the constructed 3 macrostates. The key hydrogen bonds between protein and DNA are shown. Please click here to view a larger version of this figure.
Figure 4: The coarse-grain (CG) model and contacts formed between protein and DNA strands in the CG model. (A) The coarse-graining of protein (left) and DNA (right). (B) The contact number between WRKY and each DNA strand along the simulation. (C) The molecular views of the 4 contact modes. The protein region near the zinc-finger is colored in gray, and the other region is colored in green. (D) The contact probability of each protein amino acid with DNA. When the distance between the CG particle of the amino acid and any DNA CG particles is smaller than 7 Å, the amino acid is considered to be in contact with DNA. Please click here to view a larger version of this figure.
Figure 5: The diffusion step sizes of individual protein amino acid in the WRKY motif as WRKY moving along DNA. (A) The highly conserved residues (WRKYGQK) in atomic structure (left) and after coarse-graining (right). (B) The stepping size for each conserved residue on different sequences of DNA (poly-A; poly-AT; random sequences) Please click here to view a larger version of this figure.
Supplementary File 1: The python codes and software used in this protocol. MSM is built mainly by using the MSMbuilder, the necessary python codes are attached. Please click here to download this File.
Supplementary File 2: The atomistic molecular dynamics simulations are conducted by GROMACS, the commands and necessary files to build all-atom simulations are also attached. The coarse-grained simulations are conducted by CafeMol software. The simulation results are analyzed by VMD and MATLAB. Please click here to download this File.
Supplementary File 3: The tcl script to rotate and move protein in VMD. Please click here to download this File.
This work addresses how to conduct structure-based computational simulation and samplings to reveal a transcription factor or TF protein moving along DNA, not only at atomic detail of stepping, but also in the processive diffusion, which is essential for the facilitated diffusion of TF in the DNA target search. To do that, the Markov state model or MSM of a small TF domain protein WRKY stepping for 1-bp along homogeneous poly-A DNA was first constructed, so that an ensemble of protein conformations on the DNA along with collective hydrogen bonding or HB dynamics at the protein-DNA interface can be revealed. To obtain the MSM, we conducted two rounds of extensive all-atom MD simulations along a spontaneous protein stepping path (obtained from previous 10-μs simulation), with current samplings in aggregation of 7.5 μs (125 x 60 ns). Such extensive samplings provide us with snapshots for conformation clustering into hundreds of microstates, utilizing protein-DNA interfacial pair distances as geometric measures for the clustering. The Markovian property of the MSM construction is partially validated via detecting time-scale separation from the implied time scales calculated for various lengths or lag-time of individual MD simulations. 20–2000 microstates were then tested and compared for the time-scale separation properties, with 500 microstates selected for the MSM construction. Further, the 500 microstates were kinetically lumped into a small number of macrostates, for which we tested various number of states and found that three macrostates sufficient for the current system. The three-state model simply shows that state S1 transits to S2 comparatively fast (within tens of ns), dominated by protein center of mass (COM) fluctuations on the DNA, while state S2 transits to S3 slowly and is rate-limiting (~7 μs on average), dominated by collective HB dynamics for stepping. Note that kinetic lumping of the microstates into a small number of kinetically distinct macrostates is still subject to methodological developments, with different algorithms tested and machine learning techniques for improvements57,58,59,60,61,62,63. The critical steps to build MSM include choosing the distance pairs used in tICA and determining the parameters used to construct microstates. The choice of distance pairs is knowledge based, and it is important to choose the most essential interaction pairs. The parameters for constructing microstates, such as the correlation delay time, lag time, the muber of microstates, need to be properly set to ensure the system to be Markovian.
With such efforts, the submicro- to micro-seconds protein structural dynamics with atomic details can be systematically revealed for protein stepping 1-bp along DNA. In principle, with the transition probability matrix obtained from the MSM construction, the system can be evolved to a long time scale beyond microseconds, or say, to approach milliseconds and above13,17,64. However, there are intrinsic limitations of the MSM sampling and construction, which rely on sub-microseconds individual simulations around a certain initial path, and the Markovian property may not be well guaranteed 65,66. In most practices, the initial path was constructed under forcing or acceleration, though in the current system we take advantage of a spontaneous protein stepping path (without forcing or acceleration) obtained from a 10-ms equilibrium simulation8. The conformational samplings in aggregate are still limited by tens of microseconds due to high computational cost of the atomic simulations. Such microseconds samplings of the protein stepping are unlikely to provide sufficient conformations to appear on long-time scale processive TF diffusion. The memory issue would become significant if one implements the currently obtained transition probability matrix beyond a certain time scale, and the Markovian property is not guaranteed to ensure proper use of current MSM14,52,66. Therefore, to sample the long-time scale processive diffusion of TF along DNA, the residue level coarse-grained or CG modeling and simulation are implemented instead, to balance between maintaining the structural basis and lowering the computational cost.
In the CG modeling and simulation, the protein residues and DNA nucleotides are represented by beads (i.e., one bead for one amino acid, and three beads for one nucleotide), with the protein conformation maintained via the Go model toward a native or pre-equilibrated configuration30,53. Though the atomic level of HB interactions becomes absent in the CG model, the protein-DNA electrostatic interactions are well maintained, which seem to be able to capture dominant dynamics features in the processive diffusion of the protein along DNA67,68,69,70. Detailed implementation protocols are presented for modeling and simulating the WRKY-DNA system here. The representative results show interestingly that first, the single-strand DNA bias presented in the previous atomic simulation of the WRKY-DNA system persists in the CG model, while a variety of protein orientations/configurations sampled during processive diffusion lead to switch of the bias between the two strands from time to time. Hence, such a DNA strand bias does not necessarily link to HB association but seems to rely mainly on the protein-DNA electrostatic interactions, which vary for various protein configurations or orientations on the DNA. Next, individual amino acids at or near the protein-DNA interface, such as the highly conserved WRKQGQK motifs, show different stepping sizes or synchronization patterns for different DNA sequences. In our previous study, the stepping size variations were shown only for the COM of protein, as the protein was modeled to diffuse along different DNA sequences. Note that the current CG model of the DNA supports DNA sequence variations with different parameterization54,71,72, though atomic detail is missing. Proper DNA sequence-dependent parameterization in the structure-based modeling of the protein-DNA system, is thus critical to reveal protein-DNA search and recognition mechanisms across multiple time and length scales.
The authors have nothing to disclose.
This work has been supported by NSFC Grant #11775016 and #11635002. JY has been supported by the CMCF of UCI via NSF DMS 1763272 and the Simons Foundation grant #594598 and start-up fund from UCI. LTD has been supported by Natural Science Foundation of Shanghai #20ZR1425400 and #21JC1403100. We also acknowledge the computational support from the Beijing Computational Science Research Center (CSRC).
CafeMol | Kyoto University | coarse-grained (CG) simulations | |
GROMACS | University of Groningen Royal Institute of Technology Uppsala University | molecular dynamics simulations software | |
Matlab | MathWorks | Numerical calculation software | |
MSMbuilder | Stanford University | build MSM | |
VMD | UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN | molecular visualization program |