DCA-MOL

Inputs

To run DCA-MOL, you need three input files: a Direct Information (DI) score file, a PDB file with 3D coordinates and a FASTA formatted alignment file.

DI scores file: The DI scores file is obtained as a results of performing Direct Couplings Analysis (DCA) in a family of evolutionary related sequences like the ones found in Pfam (cite) or created by the user using sequence alignment. DCA is a statistical inference framework to infer direct co-evolutionary couplings among residue pairs in multiple sequence alignments. With a FASTA format multiple sequence alignment for a protein or RNA family (from Pfam or Rfam), a user can submit a job to the DCA webserver (dca.rice.edu) or run DCA in a MATLAB implementation of the code downloaded from the website. A DI file contains scores for all the possible residue-residue interactions in the MSA. For more information about DCA and how to submit the jobs, you can check the DCA paper[1] or this site.

The output DI scores file from the server consists of three columns, the first two columns represent the coupled residues indices in the MSA and the third column contains the DI scores for these pairs. The default setting of DCA-MOL assumes three columns (Fig. 1) as input. When you run the DCA within the MATLAB implementation, the output consists of four columns where the fourth column represents the DI score, a popup will appear asking the user to indicate which column contains DI scores. We also note that the input to DCA-mol could in principle come from any other method to score pairwise couplings, as long as the input file has a similar column format – with first two containing position indices, and later a column with the pairwise metric.

Pro tip: There are some example files given along with the plugin code on GitHub.

PDB 3D coordinates: To visualize selected couplings on a specific structure, 3D coordinates of the biomolecule are required. The user has three options to load the structure in DCA-MOL (Fig. 2):

Type the PDB ID. DCA-MOL will directly get the structure from RCSB PDB database.
Load a local file. You can download the .pdb file from PDB database then load this file in DCA-MOL. You can modify the structure file if required.
Select a structure already loaded in the PyMOL interface (option present only when there are some structures in PyMOL) - this creates an exact copy of the prepared structure, to preserve any previous modifications (colouring, structure editing) without permanently changing it.

After uploading the structures, you need to select which chain in the structure is related to the MSA you analyzed with DCA. If no chain name is provided, the first chain present in the structure file will be selected by default.

FASTA formatted alignment file: To map the DI scores to the structure, the user must extract the sequence from the family input MSA that corresponds to the sequence of the structure of interest. To find this sequence in the MSA, the simplest way is to compare the name of the sequence in Uniport by the Uniport ID that is provided by PDB database. For example, the PDB (5DN6) chain A sequence has a Uniport ID "A1B8N8", the name in parenthesis "ATPA_PARDP" is the name used in the Pfam MSA. Once the ID is known, the user can easily extract the sequence from the alignment and create the FASTA formatted alignment input file.

Note: in principle the user could have multiple sequences and interactively select the proper one. However, for large MSA files the selection becomes impractical.

Fig. 1 DI file and FASTA formatted alignment file loading screen.

Fig. 2 Specific PDB structure selection screen.

Initialization and Output

The first step of DCA-MOL is loading the alignment file and DI scores files, starting the analysis by choosing single-state (for visualization of monomeric or interfacial interactions) or multi-state (to analyze multiple protein structures or conformations). Verify the alignment sequence name that is related to a structure, then select process with protein or RNA/DNA and load the structure.

The DCA-MOL analysis starts by creating a pairwise alignment between the sequence from the FASTA formatted alignment file and sequence read from the 3D coordinates file. The alignment will be presented to the user and will allow the user to correct any potential mismatches that the user might be aware of. After this, the distance of molecules in the structure will be calculated. Note: This may take a couple of seconds to finish computation.

The main visualization outputs of DCA-MOL are a DI score contact map plot and a 3D structure visualization of selected couplings in the map. The default contact map is the plain symmetric DI scores contact map (all DI pairs which over the "Lower range" threshold will be shown in the contact map, colored by the scores). (Fig. 3, left) It is possible to later modify both the contact map plot and 3D structure plot by using several features included in DCA-MOL.

Fig. 3 Basic view of a tRNA contact map and its 3D coordinates in DCA-MOL.

Basic options in DCA-MOL

DI contact map

The default contact map in DCA-MOL depicts a plot of DI scores all residue-residue pairs in a given sequence alignment. The user can select a set of pairs to be displayed in the contact map based on their minimum values ("Lower range border") and their maximum DI value ("Upper range border"). The number of shown DI pairs can also be restricted using the option "Show top N contacts" and "Show top N% contacts as illustrated in Fig. 4.

For each contact map, it is possible to specify the color of DI pairs from the drop-down list of Colormap. Secondary structure of each region is represented with different colors along the y-axis and x-axis (red for alpha helices and blue for beta sheets in protein). For RNA, only helices are shown and indicate the base-pairing. If some residues don't have coordinates in their structure, gray 'X' will be shown along the y-axis and x-axis.

Below the contact map plot is the Matplotlib navigation toolbar, which includes panning and zooming capabilities, and information regarding current cursor position and the value of it.

Fig. 4 Alternative contact map plot and structure plot of tRNA.

Interactively selecting interactions on contact maps and 3D structures

Selecting contacts on the DI contact map plot (either by a single click or by clicking and dragging) will show coevolving pairs in the structure visualization window as sticks connecting paired residues. The color of the interaction links is following the dot color in contact map. Selecting while holding "Ctrl" key will allow the user to have multiple selections in different regions. Selecting with the Right Mouse Button marks inter-chain contacts (interaction across different chains). Intra-chain selections are shown on the plot as red dashed rectangles, inter-chain as green. It's possible to toggle selection between these two states using the "Toggle inter/intra chain selection" button with the single structure plot menu.

In addition to be able to select pairs on the contact map, you can also select spatial regions in the PyMOL viewer window (either by selecting the sequence, or on the structure) and then click on the "Mark structural selections on the plot" button, in the left control bar. A corresponding rectangle will be shown on the plot, and contacts within will appear on the structure.

Single structure plot menu

The single structure plot menu can be detached from the window by clicking on the dashed line in the first row of the menu. This menu contains alternative plot modes for a given structure, which include:

"Show list of selected bonds" After we selected some DI contacts, "Show list of selected bonds" will give us more detailed information about these selected residue pairs, like the position, residues and distance in the structure.
"Native contacts comparison mode" The default contact map is a symmetric DI scores contact map. To compare the DI score interaction with the native contacts, we can choose "Native contacts comparison mode" to change the lower triangle to native contact map. For the native contact map, computed directly from the experimental 3D coordinates, you can select different contact map modes. We currently have 3 contact map modes for proteins: C-alpha, C-beta and All heavy atoms. These maps measure the distance between these types of atoms and places a contact if the distance threshold is satisfied. The user can set the distance threshold too. And 4 contact map modes are available for RNA: C1' carbon, C4' carbon, 05' oxygen and all heavy atoms. You can also adjust the maximum distance of pairs that shows in the native contact map.
"Align plot to structure" This option will change axes of the plot, let the axes correspond to residues in the structure, instead of the sequences from the alignment.
"Overlay DI on native contacts" In order to compare the DI pairs and native pairs, it is possible to overlay DI pairs on top of the native contacts. Grey dots indicate native structure pairs, red are unmatched DI pairs, and blue are pairs present both in the structure and amongst DI pairs.
"Recolor by true/false positives" Based on the native contacts calculated from the 3D structure, DI pairs could be recolored as true positive (pink) and false positive (blue) to have an idea of the overlap of DI pairs to the native fold of the molecule. It is also possible to toggle off false positives and show only true positives.

File menu

Using the File menu you can save the currently shown plot in multiple formats as well as write out the native contacts, DI pairs, displayed bonds and distances. Additionally, default program options can be changed in Options menu, such as default color map, resolution of saved images, and colors for True Positive maps.

All menus can be detached from the DCA-MOL window by clicking on the dashed line in the first row of the menu.

In addition to the different styles of contact map plots and structural visualization, DCA-MOL also provides different models of analysis. Here we showcase some sample cases:

Intra-chain interaction inside a single protein/RNA.
Inter-chain interaction between protein/RNA interfaces.
Multiple state models and conformational dynamics.

All the input files for the examples below are provided along with the plugin code on GitHub.

Case 1: Intra-chain interaction inside a single protein

Monomeric interactions inside proteins or RNAs could be analyzed by the single state model. The tRNA example shown in Fig. 5 was obtained using the single state model. The ATP synthase protein [PDB-5DN6, Pfam PF00006] provides another example for the single state model of DCA-MOL. By adjusting the contact display to show the top 200 contacts, DCA-MOL generates a contact map. Native contacts in this example are defined as pairs whose distances are shorter than 8 Å in the 3D structure. The map shows a true positive rate about 61%. By marking the DI contacts in 3D structure, DI pairs not only successfully capture the alpha helix and beta sheet interactions (Fig. 5), but also capture some long-distance interactions (Fig. 6). These long-distance interactions are far away in the sequence but very close in structure. This information could be very useful in protein structure prediction.

Fig. 5 Interaction between alpha helix and beta sheet in ATP synthase (5DN6).

Fig. 6 Long distance interactions in ATP synthase.

Case 2: Inter-chain interaction between the interface of protein/RNA.

Interactions between residues in proteins are not limited to a single domain but can appear between residues of different domains. Studies using DCA showed that coevolutionary information could be used to capture both the information of intra-chain as well as inter-chain interactions. But how to separate these interactions and further study the function of different types of conformations becomes time-consuming for switching the analysis between several models. DCA-MOL provides a highly efficient tool that allows people to easily switch and compare between intra-chain and inter-chain interactions.

In this sample case, we use isocitrate dehydrogenase dimers to showcase this utility (Fig. 7). Isocitrate dehydrogenase is a homodimer containing two chains (A and B) [ PDB id 2iv0 and Pfam ID PF00089.25]. In order to let DCA-MOL analyse both the monomer and the dimer interaction, our input FASTA formatted alignment has slight differences compared to the basic single state study. The alignment file should contain the alignment sequences for both chains. Because isocitrate dehydrogenase is a homodimer, we duplicate the alignment in the input file. After loading the DI score file and alignment file, we start the analysis by clicking single state. Then the program will ask you 'for which sequences(s) from the alignment do you want to assign a structure'. The user needs to select both alignment sequences provided in the previous step and proceed as "protein", loading for each sequence different chains (A/B) from the same PDB file, and check the 'part of an interface' option.

Fig. 7 DI pairs in Isocitrate dehydrogenase monomer showing long distance interactions in the structure.

The default plot will show monomeric interactions. Then you can select "Interface model" from the drop-down list in the upper left corner of DCA-MOL. The program will recalculate the native contacts and show you the interface plot. You can easily return to the intra-chain study by selecting the other option from the drop-down list. Some high ranked DI pairs interactions are far away from each other in the monomer (Fig. 7) structure. But in the dimer structure, these pairs are very close (Fig. 8). This change in distance for these pairs suggests these interactions maybe important to keep quaternary structure and function.

Same strategies could also be applied for large complexes by increasing the sequences in alignment files and input structures.

Fig. 8 Same DI pairs across the interface of isocitrate dehydrogenase dimer showing shorter distance in the structure.

Case 3: Multiple state model and conformational dynamic.

Some proteins may experience large conformational changes. DCA-MOL could help you easily study how coevolving pairs play a role in different protein conformations (treated here as as multiple states). Here, we use DCA-MOL to illustrate the conformational change of L-leucine binding protein [PDB IDs 1usg and 1usi, Pfam PF13458.5]. L-leucine binding protein has two states, open state 1usg and closed state 1usi. Since the protein sequences are identified in two structures but with different conformations, we just need one alignment sequence in the alignment file. After loading the alignment file and DI scores file, we start the analysis by choosing "multi-state". When specifying the PDB structure, we first load the open state (1usg, chain A) and click "add more states" to add the closed state (1usi, chain A).

The contact map is very similar to the single state contact map. But in the right upper corner of the contact map plot, you will find the option 'change current state'. This allows you to switch between different states. The order of the states will be the same as the one the user used to load the structures. After choosing the native contacts comparison mode. The native contacts map (lower triangle) shows a similar pattern between two states, except for a set of contacts exclusive to the closed state. In the predicted DI contact map, we get very accurate interaction information for both open state and closed state. Some DI pairs showing a long-distances in Fig. 9 of the open state structure would be normally considered as false positives. However, when marked these DI pairs on the closed state, these DI pairs are very close in the structure (Fig. 10). The comparison between open and closed state indicates that these DI pairs are essential for the protein's functional conformational change. By integrating the structure information taken from different states, DI pairs could help people to identify an ensemble of different conformation of the protein along with their coevolutionary signals. The multiple states model of DCA-MOL will allow you to clearly visualize the interactions during protein dynamics.

Fig. 9 Multiple states of L-leucine-binding protein. Open state 1USG

Fig. 10 Multiple states L-leucine-binding protein. Closed state 1USI