Which type of mutation is most likely to render a protein non functional?
In this review we describe a protocol to annotate the effects of missense mutations on proteins, their functions, stability, and binding. For this purpose we present a collection of the most comprehensive databases which store different types of sequencing data on missense mutations, we discuss their relationships, possible intersections, and unique features. Next, we suggest an annotation workflow using the state-of-the art methods and highlight their usability, advantages, and limitations for different cases. Finally, we address a particularly difficult problem of deciphering the molecular mechanisms of mutations on proteins and protein complexes to understand the origins and mechanisms of diseases. Show
Keywords: Protein–protein interactions, Databases, Mutations 1 IntroductionThe era of genome sequencing has unraveled a large number of human genetic variations, as illustrated by the milestone 1000 Genomes project [1]. Mutations and genetic recombinations may occur naturally during the cell division and at the same time may be caused by extrinsic factors. A single nucleotide substitution that results in a codon change encoding a different amino acid is called “missense point mutation” (called “mutation” thereafter). Germline missense mutations are passed to progeny, whereas somatic mutations are not inherited. Due to predominantly neutral effects of many genetic variations, they have accumulated in human population and can be responsible for many individual phenotypic traits in humans and may be used for genetic fingerprinting. Generally, a variant frequently occurring in a population is termed a polymorphisms and single nucleotide polymorphisms (SNP) are one of the most common types of genetic variations. Missense mutations can render proteins nonfunctional and may be responsible for many diseases. From the clinical perspective, these non-neutral mutations affecting human health represent the main interest. For some diseases and genes, particularly following the Mendelian inheritance patterns, the causal genotype–phenotype relationship has been already established, while for complex polygenic diseases involving multiple factors it is still unknown. Moreover, genetic variants with low penetrance, weakly associated with disease phenotypes, can only be annotated for large samples and for many diseases their genetic determinants have to be discovered. 2 Materials2.1 Software
2.2 Online Resources
3 MethodsRecent advances in experimental methods reduced the cost of the DNA sequencing and equipped many labs and hospitals with genotyping and sequencing assays, so that these data along with the clinical profiles of patients can be deposited into the central archival facilities. Some of them include the databases of Genotypes and Phenotypes (dbGaP) [7] and Genetic Testing Registry (GTR) [8] developed at NCBI, NIH. The genotypes are mapped to a reference genome assembly and typically include two major categories of variations: (1) single nucleotide polymorphisms (SNP) deposited in the dbSNP database [9] and (2) larger structural genetic variations in genomes, including long indels, inversions, and copy number variants (CNVs), deposited in the dbVar database [10]. The phenotypes in these databases are mainly organized by disease names or clinical conditions, and diseases are classified according to the Disease Ontology (DO) [11] and Medical Subject Headings (MeSH) terms. Additionally, the NCBI phenotype–genotype integrator (PheGenI) [12] merges genetic variations identified by genome-wide association studies (GWAS) with dbVar, OMIM, GTR, and dbSNP databases. The clinical implications of genetic variations are recorded in several other databases. The human gene mutation (HGMD) [13] and Online Mendelian inheritance in Man (OMIM) [14] are the main databases that integrate the information on genetic Mendelian disorders with genes and mutations reported to be causative. ClinVar [15] is another archive, which collects the data on genetic variations from dbSNP and dbVar and integrates them with the available clinical evidence of these variants obtained from multiple studies. In ClinVar each variant is assigned a score which shows the consistency of the reported clinical effect in different studies. ClinVar annotates variants as: benign, likely benign, likely pathogenic, pathogenic, of uncertain significance, or variants with conflicting interpretations. Importantly, it also annotates the variants associated with individual drug effects. Germline mutations constitute the bulk of mutations in OMIM, ClinVar, and HGMD, while somatic missense mutations are found predominantly in cancer cells and are not inherited. The information on location, tissue type, and frequency of somatic mutations in tumor samples together with the cancer type can be obtained from COSMIC[16] and TCGA databases [17], while other resources, for instance, MutDB [18] and SwissVar [19], integrate the data on germline and somatic mutations for specific genes and diseases. Figure 1 describes a computational pipeline for exploring mutations and assessing their effects on protein structure, function, and interactions. Different sections of this chapter follow this pipeline and suggest the protocols to solve each specific problem. Open in a separate window Fig. 1 Mutation assessment workflow 3.1 Collecting and Integrating the Data on Human Polymorphisms and MutationsFirst, we will describe a protocol for extracting clinically relevant mutations from ClinVar, COSMIC, and TCGA databases to further analyze them with respect to their clinical and functional impacts. Here we use human monomeric Casitas B-lineage lymphoma c-CBL protein (CBL) as an example [20]. The links and references to web resources used in this protocol are listed in Table 1.
3.2 Finding Protein Domains and Functional Sites Affected by MutationsNow we consider the domain context of CBL mutations described in the previous section (p.Y371H and p.C384R) and annotate their impacts on protein interactions and signaling pathways. The web resources used in this protocol are listed in Table 2.
In addition to the NCBI IBIS server that allows analyzing the domain context and structural determinants of interactions, several other servers (DMDM, PolyDoms, and muPIT) provide the mapping of mutations on protein domains and protein structures. 3.3 Assessing If Mutations Have Damaging or Benign Effects on ProteinsMany methods have been proposed to predict the effects of missense mutations on proteins, classifying them as damaging or benign. These methods differ in terms of the properties of mutations or proteins used during the training procedure as well as in terms of the algorithms applied for decision-making. For example, machine learning algorithms train their models to distinguish known disease-associated from neutral mutations. Other methods do not explicitly train their models but almost all methods described in this section exploit the evolutionary conservation assuming that changes at conserved positions tend to be more deleterious. Besides sequence conservation various other sequence and structural features are used, which may include: changes in physicochemical properties between wild-type and substituted amino acid, structural features (mostly solvent accessibility), site mutability in DNA, and sequence context of the site. An unbiased testing and comparison of machine learning methods is obviously an issue since they are usually trained on all available datasets of mutations and it is difficult to obtain a test set which would not overlap with training set. There are several experimental studies on variants in P53, LacI, and ABCA1 proteins which can be regarded as unbiased test cases. Comparisons of different methods on these experimental sets reported up to 70 % TPR (True positive rate) at 10 % FPR (False positive rate) [30]. Models trained to distinguish Mendelian variants with pronounced deleterious effects are more appropriate and accurate for predicting the effects of Mendelian mutations. The accuracy of these models is much higher than of those models that aim to assess the effects of mutations from complex diseases including cancer. This is evident from the evaluation of PolyPhen-2 performance which yields 0.70–0.77 TPR at 10 % FPR when trained on the HumDiv dataset (Mendelian disease mutations) and drops to 0.50–0.52 TPR for HumVar (all human disease causing mutations) trained models [31]. It should be mentioned that there are several methods that are trained to distinguish cancer mutations from neutral polymorphisms (Table 3); however, no existing method can accurately identify driver and passenger mutations within the pool of cancer mutations. One of the most comprehensive comparisons of different methods to predict phenotypic effects of mutations was performed recently [32]. To avoid any bias in evaluation of these methods, most of which were trained on all available sets of disease mutations and neutral polymorphisms, the authors of this study tested different methods on an independent set of experimental studies. They concluded that there was a variation between various methods in terms of their accuracy and applicability, with SNP&GO and MuPred being the most reliable predictors. Interestingly, despite the fact that different methods use similar sets of features, only half of their correctly predicted cases overlap [32]. Since this study was published, several new methods have been introduced (see Table 3). For example, in contrast to many methods, that assess amino acid frequency distribution in a given site of interest, a recently developed method PROVEAN accounts for the sequence context around the site of interest and poorly aligned regions/sites are assigned very low scores. Overall, the effect of alignment quality on the performance of all methods is largely undetermined but suspected to be very large. Therefore, a user-based construction of accurate alignments of homologous proteins would be very advantageous for accurate annotations of the effects of mutations. 3.4 Predicting the Impact of Mutations on Protein StabilityWhile methods, which provide a classification of damaging effects of mutations, are widely used by the genomics community, a new level of annotation is needed to offer an explanation of why and how these mutations damage proteins. Algorithms and servers described in the next several sections address these tasks. Proteins may evolve through the acquisition of new mutations, most of which are destabilizing but phenotypically neutral. Stability of a protein may be directly related to its functional activity and incorrect folding and decreased stability can be the major consequences of pathogenic missense mutations [33, 34]. However, protein stability is necessary but not sufficient for protein function, and proteins do not evolve to maximize their stability. Typically, the magnitude of effects of mutations on stability can be quantified by changes in unfolding free energy (ΔΔGfold) (Fig. 4). Open in a separate window Fig. 4 Annotation of the effects of mutations on proteins with available structures ΔΔGfold=ΔGfoldmut-ΔGfoldWT (1) Table 4 lists several state-of-the art methods for predicting the quantitative changes in unfolding free energy upon mutations and provides short descriptions and links to corresponding programs/servers. Methods described in this section differ in terms of energy functions, procedures used for optimization and sampling, and algorithms used for training, if applicable. Energy functions may vary from physics-based force fields, which describe fundamental physical forces between atoms, to knowledge-based potentials, which are based on statistical analyses of protein structures and residue properties. The majority of these methods require the coordinates of protein structures while methods like MuStab or iPTREE-STAB do not use structural data but their performance is also limited. The performance of different methods was evaluated in several studies [35–37]. In the first study [35] the following performance ranking was reported: EGAD>CC/PBSA>I-Mutan t2.0>FoldX>Hunter>Rosetta with correlation coefficients between experimental and predicted ΔΔG values in the range of 0.59 and 0.26 and standard deviation in the range of 0.95 and 2.32 kcal mol−1. However, the servers of the top performing methods EGAD and CC/PBSA are no longer available. In the second study I-Mutant3.0, Dmutant, and FoldX were found to be the most reliable predictors [36]. There are several servers to assess the effects of mutations on stability that are straightforward and easy to use. Here we present a protocol on how to use FoldX software.
3.5 Predicting the Effects of Mutations on Protein–Protein Binding AffinityCrucial prerequisite for proper biological function is a protein’s ability to establish highly selective interactions with macromolecular partners. A missense mutation that affects protein interactions [38–40] may cause significant perturbations or complete abolishment of protein function, potentially leading to diseases. Typically, the change in binding free energy (ΔΔGbind) is used to quantify the magnitude of mutational effects on protein–protein interactions (Fig. 4). ΔΔGbind=ΔGbindmut-ΔGbindWT (2) The binding energy is calculated as a difference between the free energies of a complex AB and unbound proteins A and B: ΔGbind = GAB - GA - GB (3) There are very few methods that estimate actual ΔΔGbind values and these methods require all-atom or at least protein backbone atom coordinates of a wild-type and/or mutated protein. Some of the methods use coarse-grained predictors based on statistical or empirical potentials, others apply molecular mechanics force fields with different solvation models. For example, the molecular mechanics Poisson–Boltzmann surface area (MM-PBSA) method and its derivatives have been shown to yield very good agreement between predicted and experimental values with correlation coefficients up to 0.69 [41]. For all methods, the right choices of minimization protocols, energy functions, and solvation models are crucial for achieving reasonable prediction accuracy. In addition, prediction accuracy strongly depends on the type of mutation and its location in a protein complex. For example, if a residue is located on the protein–protein interface, its mutation might have a larger effect on protein–protein interaction and binding affinity compared to a non-interfacial mutation [41]. A location of mutated sites can be mapped by SPPIDER (http://sppider.cchmc.org/) [42] or Meta-PPISP (http://pipe.scs.fsu.edu/meta-ppisp.html) [43] servers. These servers are recommended by two assessments of computational methods for predicting protein–protein interaction sites [44, 45]. Users can also analyze structures and locations of mutations using software Chimera or VMD. Below is a step-by-step protocol reported in our previous paper [41] to predict the impact of mutations on binding affinity. This protocol combines molecular mechanics force fields with statistical (BeAtMuSiC) and empirical (FoldX) energy functions. All files are provided via the ftp site ftp://ftp.ncbi.nih.gov/pub/panch/Mutation_binding. The improved version can be available from our MutaBind server https://www.ncbi.nlm.nih.gov/projects/mutabind/.
Recently a new computational method MutaBind [46] was developed to evaluate the effects of mutations on protein–protein interactions (http://www.ncbi.nlm.nih.gov/projects/mutabind/). The MutaBind method uses molecular mechanics force fields, statistical potentials, and fast side-chain optimization algorithms. It maps mutations on a protein complex structure, calculates the associated changes in binding affinity, determines the deleterious effect of a mutation, estimates the confidence of this prediction, and produces a mutant structural model for download. 3.6 Assessing the Changes in Protein Conformations and Hydrogen Bond Networks Induced by MutationsProteins may adopt different conformations along the pathway of a biochemical reaction and their intrinsic flexibility and ability to sample alternative conformations are crucial for protein function. Mutations might shift the equilibrium between different conformations (Fig. 4) and as a result, the most populated conformation of a mutated protein can be different in structure, stability, and functional activity from the wild-type conformation. It is extremely difficult to model structural changes in a protein backbone produced by mutations and large conformational shifts can be predicted correctly only in a few cases. In fact, most algorithms discussed in the previous sections do not account for the backbone flexibility. If several conformations of the same protein are available in the structural databank, all of them ideally should be used to provide a complete picture of dynamical and energetic effects of mutations [20]. Mutations can either change the global conformation of an entire molecule or have a localized effect in a small region. In a recent study of the NFAT5 transcription factor [47], different mutations from the same DNA-binding loop were analyzed and it was shown that effects of these mutations on protein dynamics and DNA binding were drastically different although they were located very close to each other in sequence and structure. Protein dynamics can be studied through performing molecular dynamics (MD) simulations using NAMD [2], CHARMM [3] and Amber [48] MD packages. NAMD, for example, is fast and easy to use; it can be applied with CHARMM or Amber force fields, whereas VMD or CHARMM packages can be used to analyze the MD trajectories produced by NAMD. Changes in structure may also be assessed through the analyses of hydrogen bond networks and their differences between mutant and wild-type proteins since hydrogen bonds are important in determining protein stability. A mutation disrupting hydrogen bonds might have a significant impact on protein conformation, stability, and dynamics (reviewed in [49], Fig. 4). Hydrogen bonds can be calculated using HBOND (http://caps.ncbs.res.in/iws/hbond.html) [50] or PIC servers (http://pic.mbu.iisc.ernet.in/) [51] and visualized by Chimera. Below is a step-by-step protocol to assess the conformational changes induced by mutations:
|