Which type of mutation is most likely to render a protein non functional?

In this review we describe a protocol to annotate the effects of missense mutations on proteins, their functions, stability, and binding. For this purpose we present a collection of the most comprehensive databases which store different types of sequencing data on missense mutations, we discuss their relationships, possible intersections, and unique features. Next, we suggest an annotation workflow using the state-of-the art methods and highlight their usability, advantages, and limitations for different cases. Finally, we address a particularly difficult problem of deciphering the molecular mechanisms of mutations on proteins and protein complexes to understand the origins and mechanisms of diseases.

Keywords: Protein–protein interactions, Databases, Mutations

1 Introduction

The era of genome sequencing has unraveled a large number of human genetic variations, as illustrated by the milestone 1000 Genomes project [1]. Mutations and genetic recombinations may occur naturally during the cell division and at the same time may be caused by extrinsic factors. A single nucleotide substitution that results in a codon change encoding a different amino acid is called “missense point mutation” (called “mutation” thereafter). Germline missense mutations are passed to progeny, whereas somatic mutations are not inherited. Due to predominantly neutral effects of many genetic variations, they have accumulated in human population and can be responsible for many individual phenotypic traits in humans and may be used for genetic fingerprinting. Generally, a variant frequently occurring in a population is termed a polymorphisms and single nucleotide polymorphisms (SNP) are one of the most common types of genetic variations.

Missense mutations can render proteins nonfunctional and may be responsible for many diseases. From the clinical perspective, these non-neutral mutations affecting human health represent the main interest. For some diseases and genes, particularly following the Mendelian inheritance patterns, the causal genotype–phenotype relationship has been already established, while for complex polygenic diseases involving multiple factors it is still unknown. Moreover, genetic variants with low penetrance, weakly associated with disease phenotypes, can only be annotated for large samples and for many diseases their genetic determinants have to be discovered.

2 Materials

2.1 Software

  1. Molecular dynamics packages: NAMD (http://www.ks.uiuc.edu/Research/namd/) [2] and CHARMM (http://www.charmm.org/) [3].

  2. Structural visualization packages: VMD (http://www.ks.uiuc.edu/Research/vmd/) [4], Chimera (http://www.cgl.ucsf.edu/chimera/) [5] and CN3D (http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml) [6].

2.2 Online Resources

  1. FTP site at NCBI (ftp://ftp.ncbi.nih.gov/pub/panch/Mutation_binding/) includes examples of configuration and runfiles for VMD, NAMD, and CHARMM used in protocols of Subheadings 3.5 and 3.6.

  2. Databases collecting human genetic variations, mutations, and data on their clinical relevance (Table 1).

    Table 1

    A summary of databases integrating the data on human genetic variations, mutations, and their clinical relevance

    DatabaseDescriptionURLReferenceCOSMICSomatic mutations in cancerhttp://cancer.sanger.ac.uk/cosmic[16]HGMDPublished gene lesions responsible for human inherited diseasehttp://www.hgmd.cf.ac.uk/ac/[52]TCGACancer Genome Atlashttp://cancergenome.nih.gov/[17]OMIMHuman genes, inherited genetic disorders, and germline mutationshttp://www.omim.org/[14]dbGaPClinical studies linking genotypes to disease phenotypeshttp://www.ncbi.nlm.nih.gov/gap[7]GTRGenetic Testing Registryhttp://www.ncbi.nlm.nih.gov/gtr/[8]PheGenIPhenotype–genotype integratorhttp://www.ncbi.nlm.nih.gov/gap/phegeni[12]ClinVarGenomic variations and their relationship to human healthhttp://www.ncbi.nlm.nih.gov/clinvar/[15]dbSNPSingle nucleotide polymorphismshttp://www.ncbi.nlm.nih.gov/SNP/[9]dbVarLarge genetic variationshttp://www.ncbi.nlm.nih.gov/dbvar[10]SwissVarDisease-related variants in Uniprot, provides structural mappinghttp://swissvar.expasy.org[19]PharmGKBAssociates genes with drugs. Catalogs genetic variations known to impact drug responsehttps://www.pharmgkb.org/[53]MutDBIntegrates human variations and COSMIC mutations, maps to structural data, KEGG pathways, and includes predictions of effects of variations on phenotypehttp://www.mutdb.org/[18]CBioPortalVisualization and analysis of large cancer studies. It is based on TCGA and incorporates the overlapping data from COSMIChttp://cbioportal.org/[54]

    Open in a separate window

  3. Web servers for characterization of structural features of mutations (Table 2).

    Table 2

    Online resources for exploring the structural, cellular, and genomic context of mutated proteins

    DatabaseDescriptionURLReferenceHPRDIntegrates information pertaining to domain architecture, posttranslational modifications, interaction networks, and disease associationhttp://www.hprd.org[55]EBI IntActMolecular interaction datahttp://www.ebi.ac.uk/intact/[56]BioSystemsProvides integrated access to genes, proteins, small molecules, and pathwayshttp://www.ncbi.nlm.nih.gov/biosystems/[21]ReactomeCurated and peer-reviewed pathwayshttp://www.reactome.org[57]KEGGManually curated pathwayshttp://www.genome.jp/kegg/pathway.html[58]CDDAnnotates functional and binding sites in protein domain familieshttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml[23]IBISAnnotates protein–protein and protein–DNA/RNA/ion/small molecule interactions and binding sites. Identifies conserved binding sites in homologous protein complexeshttp://www.ncbi.nlm.nih.gov/Structure/ibis/ibis.cgi[27]muPITInteractive exploration of mutations and their structural contexthttp://mupit.icm.jhu.edu[59]DMDMDomain mapping of mutationshttp://bioinf.umbc.edu/dmdm/[60]PolyDomsMapping of mutations to protein domains and prediction of structural and functional impact of mutationshttps://polydoms.cchmc.org/polydoms/[61]

    Open in a separate window

  4. Web servers for predicting the phenotypic effects of mutations (Table 3)

    Table 3

    A summary of online resources for predicting the phenotypic effects of mutations

    NameFeatures and methodsURLReferenceSIFTSequence homology and physical properties of amino acidshttp://sift.bii.a-star.edu.sg[62]PolyPhen-2Bayesian models based on substitution scores, phylogenetic conservation, and structural featureshttp://genetics.bwh.harvard.edu/pph2/[31]SNPs3DSequence conservation and changes in physical properties of amino acids affecting protein stabilityhttp://snps3d.org[63]PMutNeural network classifierhttp://mmb2.pcb.ub.es/PMut/[64]SNAPNeural network classifier based on protein structural properties predicted from protein sequencehttps://www.rostlab.org/services/snap/[65]FathmmCancer-specific and other disease-specific Hidden Markov Models for predicting the functional consequences of coding and noncoding variantshttp://fathmm.biocompute.org.uk[66]MutationAssessorPredicts the effects of mutations on subfamily specific siteshttp://mutationassessor.org/[67]CHASMRandom Forest classifier for cancer somatic mutationshttp://karchinlab.org/apps/appChasm.html/[68]MutPredUses sequence conservation and structural features, posttranslational modificationshttp://mutpred.mutdb.org/[18]SNPs&GOSVM classifier of disease-related variations based on protein functional annotation (GO)http://snps.biofold.org/snps-and-go/[69]PROVEANPredicts the effects of amino acid substitutions, insertions, and deletions on protein function, allows scanning multiple mutationshttp://provean.jcvi.org/[30]FunSAVRandom Forest-based classifier, uses structural features and network properties of mutated proteinshttp://sunflower.kuicr.kyoto-u.ac.jp/sjn/FunSA/[70]nsSNPAnalyzerRandom Forest-based classifier, uses structural and evolutionary informationhttp://snpanalyzer.utmem.edu/[71]PANTHERUses phylogenetic and evolutionary informationhttp://www.pantherdb.org/tools/csnpScoreForm.jsp/[72]PhD-SNPSVM classifier, uses sequence profileshttp://gpcr.biocomp.unibo.it/cgi/predictors/PhD-SNP/PhD-SNP.cgi/[73]SAAPPrecalculated database of predicted effects of known variants, considers structural properties and clashes resulting from amino acid substitutionshttp://www.bioinf.org.uk/saap/[74]SusPectIncorporates sequence conservation and network-based featureswww.sbg.bio.ic.ac.uk/suspect/[75]KinDriverAnnotates driver mutations in protein kinases with experimental evidence demonstrating their functional rolehttp://kin-driver.leloir.org.ar/[76]ProKinOUnified resource for mining the cancer kinomehttp://vulcan.cs.uga.edu/prokino[77]

    Open in a separate window

  5. Web servers and programs for predicting the effects of mutations on protein stability (Table 4).

    Table 4

    A summary of online software resources for predicting the effects of mutations on protein stability. The second column indicates the kind of data a method requires as an input: protein sequence, structure, or any of the two. Here “ΔΔG” refers to ΔΔGfold.

    ResourceInputType of output, method, and energy functionURLReferenceFoldXStructureΔΔG using empirical force fieldshttp://foldxsuite.crg.eu/[78]PoPMuSiC-2.0StructureΔΔG using a combination of statistical potentials and neural networkshttp://dezyme.com/[79]ERISStructureΔΔG using physical force fields with atomic modelinghttp://dokhlab.unc.edu/tools/eris/[80]CUPSATStructureΔΔG using mean force atom pair and torsion angle potentialshttp://cupsat.tu-bs.de/[81]HunterStructureΔΔG using knowledge-based potentialshttp://bioinfo41.weizmann.ac.il/hunter/[82]MultiMutateStructureΔΔG using four-body scoring functionshttp://www.math.wsu.edu/math/faculty/bkrishna/DT/Mutate/[83]AUTO-MUTEStructureΔΔG using knowledge-based potentialshttp://proteins.gmu.edu/automute[84]NeEMOStructureΔΔG using residue interaction networkshttp://protein.bio.unipd.it/neemo/[85]DUETStructureΔΔG using an integrated computational approach of mCSM and SDMhttp://bleoberis.bioc.cam.ac.uk/duet/stability[86]MAESTROStructureΔΔG using multi agent stability predictionhttp://biwww.che.sbg.ac.at/MAESTRO[87]I-Mutant3.0Structure/SequenceΔΔG using SVMshttp://gpcr2.biocomp.unibo.it/cgi/predictors/I-Mutant3.0/I-Mutant3.0.cgi[88]MUProStructure/SequencePredicts qualitative decrease/increase of stability using SVMhttp://mupro.proteomics.ics.uci.edu/[89]iStableStructure/SequenceΔΔG using SVMhttp://predictor.nchu.edu.tw/iStable[90]MuStabSequencePredicts qualitative decrease/increase of stability using SVMhttp://bioinfo.ggc.org/mustab/[91]iPTREE-STABSequenceΔΔG using decision tree methodshttp://bioinformatics.myweb.hinet.net/iptree.htm[92]

    Open in a separate window

  6. Web servers and programs for predicting the effects of mutations on protein–protein binding affinity (Table 5).

    Table 5

    A summary of online and software resources for predicting the effects of mutations on protein–protein binding affinity. Here “ΔΔG” refers to ΔΔGbind. All resources need structure as an input

    ResourceType of output, method, and energy functionURLReferenceMutaBindΔΔG using molecular mechanics force fields, statistical potentials, and fast side-chain optimization algorithms. Produces a model of mutant.https://www.ncbi.nlm.nih.gov/projects/mutabind/[46]BeAtMuSiCΔΔG using a set of statistical potentials, does not produce a model of mutanthttp://babylone.ulb.ac.be/beatmusic[93]ELASPICΔΔG for mutations located on interface using a combination of semi-empirical energy terms and molecular features; does not produce a model of mutanthttp://www.kimlab.org/software/elaspic[94]DrugScorePPIΔΔG for alanine-scanning mutations located on interface using knowledge-based scoring functions; does not produce a model of mutanthttp://cpclab.uni-duesseldorf.de/dsppi/[95]SNP-INclassifies effects of mutations on protein–protein interactions using supervised and semi-supervised learning; does not produce a model of mutanthttp://andromeda.rnet.missouri.edu/snpintool/[96]FoldXΔΔG using empirical force field, Produces a model of mutant.http://foldxsuite.crg.eu/[78]

    Open in a separate window

3 Methods

Recent advances in experimental methods reduced the cost of the DNA sequencing and equipped many labs and hospitals with genotyping and sequencing assays, so that these data along with the clinical profiles of patients can be deposited into the central archival facilities. Some of them include the databases of Genotypes and Phenotypes (dbGaP) [7] and Genetic Testing Registry (GTR) [8] developed at NCBI, NIH. The genotypes are mapped to a reference genome assembly and typically include two major categories of variations: (1) single nucleotide polymorphisms (SNP) deposited in the dbSNP database [9] and (2) larger structural genetic variations in genomes, including long indels, inversions, and copy number variants (CNVs), deposited in the dbVar database [10]. The phenotypes in these databases are mainly organized by disease names or clinical conditions, and diseases are classified according to the Disease Ontology (DO) [11] and Medical Subject Headings (MeSH) terms. Additionally, the NCBI phenotype–genotype integrator (PheGenI) [12] merges genetic variations identified by genome-wide association studies (GWAS) with dbVar, OMIM, GTR, and dbSNP databases.

The clinical implications of genetic variations are recorded in several other databases. The human gene mutation (HGMD) [13] and Online Mendelian inheritance in Man (OMIM) [14] are the main databases that integrate the information on genetic Mendelian disorders with genes and mutations reported to be causative. ClinVar [15] is another archive, which collects the data on genetic variations from dbSNP and dbVar and integrates them with the available clinical evidence of these variants obtained from multiple studies. In ClinVar each variant is assigned a score which shows the consistency of the reported clinical effect in different studies. ClinVar annotates variants as: benign, likely benign, likely pathogenic, pathogenic, of uncertain significance, or variants with conflicting interpretations. Importantly, it also annotates the variants associated with individual drug effects.

Germline mutations constitute the bulk of mutations in OMIM, ClinVar, and HGMD, while somatic missense mutations are found predominantly in cancer cells and are not inherited. The information on location, tissue type, and frequency of somatic mutations in tumor samples together with the cancer type can be obtained from COSMIC[16] and TCGA databases [17], while other resources, for instance, MutDB [18] and SwissVar [19], integrate the data on germline and somatic mutations for specific genes and diseases.

Figure 1 describes a computational pipeline for exploring mutations and assessing their effects on protein structure, function, and interactions. Different sections of this chapter follow this pipeline and suggest the protocols to solve each specific problem.

Which type of mutation is most likely to render a protein non functional?

Open in a separate window

Fig. 1

Mutation assessment workflow

3.1 Collecting and Integrating the Data on Human Polymorphisms and Mutations

First, we will describe a protocol for extracting clinically relevant mutations from ClinVar, COSMIC, and TCGA databases to further analyze them with respect to their clinical and functional impacts. Here we use human monomeric Casitas B-lineage lymphoma c-CBL protein (CBL) as an example [20]. The links and references to web resources used in this protocol are listed in Table 1.

  1. NCBI variation viewer (http://www.ncbi.nlm.nih.gov/variation/view/) [21] can be searched by gene name (CBL), Refseq accession ({"type":"entrez-protein","attrs":{"text":"NP_005179","term_id":"52426745","term_text":"NP_005179"}}NP_005179), or Uniprot ID ({"type":"entrez-protein","attrs":{"text":"P22681.2","term_id":"251757253","term_text":"P22681.2"}}P22681.2). A snapshot of variation viewer shows the genetic variation from ClinVar in the locus of CBL gene (Fig. 2a). Different variants in this viewer are depicted as separate tracks below the CBL gene locus. The ClinVar track displays multiple variants as boxes with the number of variants listed within each box. Variants with benign effect are shown using green color, whereas the purple boxes show pathogenic variants.

    Which type of mutation is most likely to render a protein non functional?

    Open in a separate window

    Fig. 2

    Identification of clinically relevant mutations in ClinVar, COSMIC, and TCGA. (a) NCBI Variation Viewer showing the CBL gene locus on chromosome 11; ClinVar and dbSNP data shown as tracks below the gene, pathogenic mutations are presented as purple squares and closely-located mutations are grouped together in ClinVar track; (b) one of the pathogenic mutations p.Y371H shown in dbSNP with the corresponding disease annotation in GTR; (c) a representative structure of CBL protein visualized in SNP3D with Cn3D software, mutated p.Y371H is shown with the yellow side chain; (d) cBioPortal view of CBL protein with missense mutations mapped onto the corresponding domains; (e) a representative structure visualized with JMol directly in cBioPortal with all mapped missense mutations shown in green

  2. Additionally, the user can download all mutations from the COSMIC ftp server in VCF format (CosmicCodingMuts.vcf.gz) and display these mutations in a separate track in the variation viewer (choose menu “your data” and select “add track” option). Note that it is necessary to select the same genome assembly in both variation viewer and in COSMIC (e.g., GRCh37).

  3. Each variant with a valid ClinVar annotation is linked to a corresponding dbVar or dbSNP record. Here we will focus on a single nucleotide variant of CBL gene with the dbSNP accession rs267606706. As shown in Fig. 2b, it is a missense mutation where nucleotide substitution T->C results in p.Y371H amino acid substitution. There is clinical evidence that this variant can cause a Noonan syndrome-like disorder and/or juvenile myelomonocytic leukemia. The GTR studies cited in ClinVar show [22] that p.Y371H is a heterozygous germline substitution.

  4. The dbSNP page for accession rs267606706 contains a link to “3D structure mapping” (found under the “NCBI resources” section) which points to the SNP3D page where several synonymous and missense variants are mapped onto the CBL protein structure. By default, this variant is selected, but it is possible to select other variants for display by clicking on the “Cn3D selected” button. Additionally, the SNP3D page contains a link “CD” that shows conserved domains from the CDD database mapped onto this protein [23]. Figure 2c depicts the structure of CBL using Cn3D with mutated Tyr residue side chain colored in yellow.

  5. As was shown previously using ClinVar, the germline p.Y371H mutation may be linked to leukemia, however, many cancer mutations are somatic and therefore are not present in ClinVar. In order to explore somatic mutations we switch to cBioPortal, which allows exploring mutations from the TCGA and COSMIC databases. Open cBioPortal web page and submit a query “CBL” as the user-defined gene set. A summary for different types of cancers will be shown for the CBL gene. Open the second tab “Mutations”, which display mutations on CBL protein sequence. Figure 2d depicts missense CBL mutations mapped by the cBioPortal onto the corresponding CDD domain context. Additionally, the cBioPortal provides locations of mutations on protein structures. The blue footprints in Fig. 2d show that several structures cover CBL protein sequence and could be explored interactively. All missense mutations from COSMIC and TCGA databases are mapped on the representative structure (in green) and are displayed in the web browser (Fig. 2e).

  6. Each mutation in cBioPortal “Mutations” tab is shown as a pin indicating its position in CBL protein sequence. The height of the pin corresponds to the number of known mutations. Place the mouse cursor over the Znf domain (zinc finger, shown in yellow) and over the first pin from the left in Znf domain. As a result, a window with a list of mutations and cancer types will pop up. For zinc finger domain the first two missense mutations, C384R and C384W, are associated with glioblastoma and melanoma, respectively. By searching for Cys384 residue in the table below and pressing the “3D” button, the structural location of this mutation is displayed. We will explain how this mutation could be interpreted in the context of molecular interactions in the next section.

3.2 Finding Protein Domains and Functional Sites Affected by Mutations

Now we consider the domain context of CBL mutations described in the previous section (p.Y371H and p.C384R) and annotate their impacts on protein interactions and signaling pathways. The web resources used in this protocol are listed in Table 2.

  1. Evolutionarily conserved sites in a multiple sequence alignment usually correspond to functionally important sites and mutations in these sites can be harmful to protein function. If a protein of interest has a known PDB structure, conservation profiles can be downloaded from the PDBsum resource otherwise ConSurf server [24] can be used. In addition, the CDD server can offer functional annotations of sites in conserved protein domains, whereas IBIS server provides locations of binding sites for different types of binding partners (protein, small molecule, nucleic acid, ion, and peptides), and facilitates the mapping of a comprehensive biomolecular interaction network for a given protein query (with or without structure) [25–27]. Similar binding sites in IBIS are clustered together based on their sequence and structure conservation.

  2. Open IBIS web page and search for 1FBV structure, chain A. Go to “protein–protein” tab and click on the balloon with the annotation “RING” domain to display interactions of the CBL RING domain with other domains/proteins (Fig. 3a). Binding sites are shown on CBL sequence as triangles and highly conserved binding sites are shown in red color. In the list of interaction partners below, the first conserved binding site cluster is formed between RING domain of CBL and ubiquitin conjugating enzyme from UBCc family. By clicking on the plus sign next to “UBCc”, one can see the corresponding binding site, the alignment of similar binding sites found in different CBL-UBCc complexes. By opening the link to Cn3D viewer, one can explore the interfaces and binding sites in these protein complexes (Fig. 3c).

    Which type of mutation is most likely to render a protein non functional?

    Open in a separate window

    Fig. 3

    Analysis of conserved functional and binding sites in mutated proteins using IBIS method. (a) conserved protein binding site in the RING domain of CBL; (b) interaction graph of CBL protein (represented by 1FBV PDB structure); (c) visualization of protein interface between CBL and UBCc in Cn3D software. Position 384 in 1FBV corresponds to position 338 in the full-length PDB sequence

  3. The interaction graph in Fig. 3b shows the observed (black lines) and predicted interaction partners of CBL. Next, we will focus on interactions of CBL with zinc ions and UBCc ubiquitin ligase. Note that self-links indicate interactions between domains in CBL protein within or between CBL chains.

  4. The structure in Fig. 3c shows that conserved cysteine residues in the binding site coordinate two zinc ions, apparently playing an important structural role. A substitution of cysteine by arginine disrupts the coordination of Zn, which affects the structure and stability of the zinc finger RING domain and may also affect CBL function.

  5. Structural and biochemical analyses [28] show that CBL inactive state adopts an autoinhibited interaction. Substrate binding and Tyr371 phosphorylation activates CBL by producing a large conformational change in order to place the RING domain and UBCc in close proximity to the substrate necessary for effective catalysis. Importantly, mutation p.Y371H may prohibit activation of CBL by phosphorylation and may also affect the interaction with UBCc.

  6. The impact of mutations on signaling pathways can be explored using recently developed PathiVar server [29]. Alternatively, it is possible to explore all pathways in which CBL interacts with UBCc using KEGG, Reactome, or NCBI Biosystems databases.

  7. Search for ‘CBL AND Y731 and “Homo sapiens”[Organism]’ in the NCBI Biosystems portal. The first record will point to the “Regulation of signaling by CBL” pathway and the role of Y731 phosphorylation will be explained in the pathway description.

In addition to the NCBI IBIS server that allows analyzing the domain context and structural determinants of interactions, several other servers (DMDM, PolyDoms, and muPIT) provide the mapping of mutations on protein domains and protein structures.

3.3 Assessing If Mutations Have Damaging or Benign Effects on Proteins

Many methods have been proposed to predict the effects of missense mutations on proteins, classifying them as damaging or benign. These methods differ in terms of the properties of mutations or proteins used during the training procedure as well as in terms of the algorithms applied for decision-making. For example, machine learning algorithms train their models to distinguish known disease-associated from neutral mutations. Other methods do not explicitly train their models but almost all methods described in this section exploit the evolutionary conservation assuming that changes at conserved positions tend to be more deleterious. Besides sequence conservation various other sequence and structural features are used, which may include: changes in physicochemical properties between wild-type and substituted amino acid, structural features (mostly solvent accessibility), site mutability in DNA, and sequence context of the site.

An unbiased testing and comparison of machine learning methods is obviously an issue since they are usually trained on all available datasets of mutations and it is difficult to obtain a test set which would not overlap with training set. There are several experimental studies on variants in P53, LacI, and ABCA1 proteins which can be regarded as unbiased test cases. Comparisons of different methods on these experimental sets reported up to 70 % TPR (True positive rate) at 10 % FPR (False positive rate) [30]. Models trained to distinguish Mendelian variants with pronounced deleterious effects are more appropriate and accurate for predicting the effects of Mendelian mutations. The accuracy of these models is much higher than of those models that aim to assess the effects of mutations from complex diseases including cancer. This is evident from the evaluation of PolyPhen-2 performance which yields 0.70–0.77 TPR at 10 % FPR when trained on the HumDiv dataset (Mendelian disease mutations) and drops to 0.50–0.52 TPR for HumVar (all human disease causing mutations) trained models [31]. It should be mentioned that there are several methods that are trained to distinguish cancer mutations from neutral polymorphisms (Table 3); however, no existing method can accurately identify driver and passenger mutations within the pool of cancer mutations.

One of the most comprehensive comparisons of different methods to predict phenotypic effects of mutations was performed recently [32]. To avoid any bias in evaluation of these methods, most of which were trained on all available sets of disease mutations and neutral polymorphisms, the authors of this study tested different methods on an independent set of experimental studies. They concluded that there was a variation between various methods in terms of their accuracy and applicability, with SNP&GO and MuPred being the most reliable predictors. Interestingly, despite the fact that different methods use similar sets of features, only half of their correctly predicted cases overlap [32]. Since this study was published, several new methods have been introduced (see Table 3). For example, in contrast to many methods, that assess amino acid frequency distribution in a given site of interest, a recently developed method PROVEAN accounts for the sequence context around the site of interest and poorly aligned regions/sites are assigned very low scores. Overall, the effect of alignment quality on the performance of all methods is largely undetermined but suspected to be very large. Therefore, a user-based construction of accurate alignments of homologous proteins would be very advantageous for accurate annotations of the effects of mutations.

3.4 Predicting the Impact of Mutations on Protein Stability

While methods, which provide a classification of damaging effects of mutations, are widely used by the genomics community, a new level of annotation is needed to offer an explanation of why and how these mutations damage proteins. Algorithms and servers described in the next several sections address these tasks. Proteins may evolve through the acquisition of new mutations, most of which are destabilizing but phenotypically neutral. Stability of a protein may be directly related to its functional activity and incorrect folding and decreased stability can be the major consequences of pathogenic missense mutations [33, 34]. However, protein stability is necessary but not sufficient for protein function, and proteins do not evolve to maximize their stability. Typically, the magnitude of effects of mutations on stability can be quantified by changes in unfolding free energy (ΔΔGfold) (Fig. 4).

Which type of mutation is most likely to render a protein non functional?

Open in a separate window

Fig. 4

Annotation of the effects of mutations on proteins with available structures

ΔΔGfold=ΔGfoldmut-ΔGfoldWT

(1)

Table 4 lists several state-of-the art methods for predicting the quantitative changes in unfolding free energy upon mutations and provides short descriptions and links to corresponding programs/servers. Methods described in this section differ in terms of energy functions, procedures used for optimization and sampling, and algorithms used for training, if applicable. Energy functions may vary from physics-based force fields, which describe fundamental physical forces between atoms, to knowledge-based potentials, which are based on statistical analyses of protein structures and residue properties. The majority of these methods require the coordinates of protein structures while methods like MuStab or iPTREE-STAB do not use structural data but their performance is also limited. The performance of different methods was evaluated in several studies [35–37]. In the first study [35] the following performance ranking was reported: EGAD>CC/PBSA>I-Mutan t2.0>FoldX>Hunter>Rosetta with correlation coefficients between experimental and predicted ΔΔG values in the range of 0.59 and 0.26 and standard deviation in the range of 0.95 and 2.32 kcal mol−1. However, the servers of the top performing methods EGAD and CC/PBSA are no longer available. In the second study I-Mutant3.0, Dmutant, and FoldX were found to be the most reliable predictors [36].

There are several servers to assess the effects of mutations on stability that are straightforward and easy to use. Here we present a protocol on how to use FoldX software.

  1. Run RepairPDB module of FoldX to correct errors in the structure produced during the refinement (nonstandard angles or distances) (Runfile can be obtained from http://foldxsuite.crg.eu/command/RepairPDB).

  2. Run BuildModel module of FoldX to introduce a mutation on the optimized wild-type structure http://foldxsuite.crg.eu/command/BuildModel. BuildModel module optimizes the configurations of the side chains of amino acids in the vicinity of the mutated site and calculates the difference in unfolding free energy (ΔΔGfold) between mutant and repaired wild-type structure. The total unfolding free energy and each energy term can be obtained from the “Dif_BuildModed_*.fxout” output file.

3.5 Predicting the Effects of Mutations on Protein–Protein Binding Affinity

Crucial prerequisite for proper biological function is a protein’s ability to establish highly selective interactions with macromolecular partners. A missense mutation that affects protein interactions [38–40] may cause significant perturbations or complete abolishment of protein function, potentially leading to diseases. Typically, the change in binding free energy (ΔΔGbind) is used to quantify the magnitude of mutational effects on protein–protein interactions (Fig. 4).

ΔΔGbind=ΔGbindmut-ΔGbindWT

(2)

The binding energy is calculated as a difference between the free energies of a complex AB and unbound proteins A and B:

ΔGbind = GAB - GA - GB

(3)

There are very few methods that estimate actual ΔΔGbind values and these methods require all-atom or at least protein backbone atom coordinates of a wild-type and/or mutated protein. Some of the methods use coarse-grained predictors based on statistical or empirical potentials, others apply molecular mechanics force fields with different solvation models. For example, the molecular mechanics Poisson–Boltzmann surface area (MM-PBSA) method and its derivatives have been shown to yield very good agreement between predicted and experimental values with correlation coefficients up to 0.69 [41]. For all methods, the right choices of minimization protocols, energy functions, and solvation models are crucial for achieving reasonable prediction accuracy. In addition, prediction accuracy strongly depends on the type of mutation and its location in a protein complex. For example, if a residue is located on the protein–protein interface, its mutation might have a larger effect on protein–protein interaction and binding affinity compared to a non-interfacial mutation [41]. A location of mutated sites can be mapped by SPPIDER (http://sppider.cchmc.org/) [42] or Meta-PPISP (http://pipe.scs.fsu.edu/meta-ppisp.html) [43] servers. These servers are recommended by two assessments of computational methods for predicting protein–protein interaction sites [44, 45]. Users can also analyze structures and locations of mutations using software Chimera or VMD.

Below is a step-by-step protocol reported in our previous paper [41] to predict the impact of mutations on binding affinity. This protocol combines molecular mechanics force fields with statistical (BeAtMuSiC) and empirical (FoldX) energy functions. All files are provided via the ftp site ftp://ftp.ncbi.nih.gov/pub/panch/Mutation_binding. The improved version can be available from our MutaBind server https://www.ncbi.nlm.nih.gov/projects/mutabind/.

  1. Install software VMD, NAMD, and CHARMM.

  2. Download a structure for your protein of interest from the Protein Data Bank (PDB).

  3. Add hydrogen atoms, a rectangular box (10 Å) of water molecules, and Na+ and Cl− ions (ionic concentration of 150 mM) to the structure using VMD (“vmd.pgn”).

  4. Carry out 5000-step energy minimization with harmonic restraints (with the force constant of 5 kcal mol−1 Å−2) applied on the backbone atoms of all residues (“minimization1.conf”), followed by a 35,000-step energy minimization on the whole system (“minimization2.conf”) with NAMD program using the CHARMM27 force field.

  5. Introduce a mutation using “mutator” plugin of VMD software on the final minimized model from step 4.

  6. Run an additional 300-step minimization for the whole mutant structure (“minimization2.conf”).

  7. Run CHARMM program using the last frame from step 4 (for wild-type structure) and step 6 (for mutant structure) to obtain van der Waals interaction energy (ΔEvdw), polar solvation energy of solute in water (ΔGsolv) for the wild-type and mutant, and interface area (ΔSAmut) for mutant (Runfile is “binding_energy.str”).

  8. Submit your structure from step 2 and a mutation to BeAtMuSiC webserver (Table 5) to obtain the binding affinity change ΔΔGBM.

  9. Run AnalyseComplex module of FoldX to obtain the binding affinity change ΔΔGFD. (http://foldxsuite.crg.eu/command/AnalyseComplex).

  10. Obtain the binding affinity changes using the following combination of energy terms from [41]:

    ΔΔGbind=αΔΔEvdw+βΔΔGsolv+γΔSAmut+εΔΔGBM+λΔΔGFD+δ(α=0.122,β=0.101,γ=0.043,ε=0.446,λ=0.168andδ=1.326).

Recently a new computational method MutaBind [46] was developed to evaluate the effects of mutations on protein–protein interactions (http://www.ncbi.nlm.nih.gov/projects/mutabind/). The MutaBind method uses molecular mechanics force fields, statistical potentials, and fast side-chain optimization algorithms. It maps mutations on a protein complex structure, calculates the associated changes in binding affinity, determines the deleterious effect of a mutation, estimates the confidence of this prediction, and produces a mutant structural model for download.

3.6 Assessing the Changes in Protein Conformations and Hydrogen Bond Networks Induced by Mutations

Proteins may adopt different conformations along the pathway of a biochemical reaction and their intrinsic flexibility and ability to sample alternative conformations are crucial for protein function. Mutations might shift the equilibrium between different conformations (Fig. 4) and as a result, the most populated conformation of a mutated protein can be different in structure, stability, and functional activity from the wild-type conformation. It is extremely difficult to model structural changes in a protein backbone produced by mutations and large conformational shifts can be predicted correctly only in a few cases. In fact, most algorithms discussed in the previous sections do not account for the backbone flexibility. If several conformations of the same protein are available in the structural databank, all of them ideally should be used to provide a complete picture of dynamical and energetic effects of mutations [20].

Mutations can either change the global conformation of an entire molecule or have a localized effect in a small region. In a recent study of the NFAT5 transcription factor [47], different mutations from the same DNA-binding loop were analyzed and it was shown that effects of these mutations on protein dynamics and DNA binding were drastically different although they were located very close to each other in sequence and structure. Protein dynamics can be studied through performing molecular dynamics (MD) simulations using NAMD [2], CHARMM [3] and Amber [48] MD packages. NAMD, for example, is fast and easy to use; it can be applied with CHARMM or Amber force fields, whereas VMD or CHARMM packages can be used to analyze the MD trajectories produced by NAMD.

Changes in structure may also be assessed through the analyses of hydrogen bond networks and their differences between mutant and wild-type proteins since hydrogen bonds are important in determining protein stability. A mutation disrupting hydrogen bonds might have a significant impact on protein conformation, stability, and dynamics (reviewed in [49], Fig. 4). Hydrogen bonds can be calculated using HBOND (http://caps.ncbs.res.in/iws/hbond.html) [50] or PIC servers (http://pic.mbu.iisc.ernet.in/) [51] and visualized by Chimera.

Below is a step-by-step protocol to assess the conformational changes induced by mutations:

  1. Download the structure and introduce a mutation using VMD (refer to steps 2 and 5 from Subheading 3.5).

  2. Build the model systems with VMD (Refer to step 3 from Subheading 3.5).

  3. Carry out the energy minimizations (Refer to step 4 from Subheading 3.5). The number of minimization steps can be chosen based on the size of system.

  4. Heat the system to 300 K over 300 ps with harmonic constraints applied to protein backbone atoms using NAMD (“heat.conf”).

  5. Perform unconstrained MD simulation on the system with NAMD (“md.conf”).

  6. Load MD trajectories into the VMD software to monitor the conformational changes and calculate the root-mean-square deviation (RMSD) between the wild-type and mutant structures.

    Assessing the effects of mutations on hydrogen bond networks using Chimera.

  7. Load your protein structure of interest into Chimera.

  8. Select residues of interest and input “findhbond selRestrict both distSlop 0.35 angleSlop 60.0 saveFile filename” in the command line. The hydrogen bonds will be shown on the screen and details will be saved in the file “filename”. One can adjust the distance (distSlop) and angle (angleSlop) parameters in the definition of hydrogen bonds.

    What type of mutation causes a nonfunctional protein?

    Nonsense Mutation A nonsense mutation occurs in DNA when a sequence change gives rise to a stop codon rather than a codon specifying an amino acid. The presence of the new stop codon results in the production of a shortened protein that is likely non-functional.

    Which type of mutation is most likely to result in a functional protein?

    A frameshift mutation is one that will most likely cause a change in the protein's structure and function.

    Which mutation may result in a non

    Missense Mutation. Missense mutations can also be benign and change an amino acid in a protein without altering its function.

    Why is a frameshift mutation more likely to result in a nonfunctional protein?

    This frameshift mutation creates an entirely new open reading frame with completely different nucleotide triplets or codons. The result is most likely an entirely changed amino acid sequence resulting in a non-functional protein.