Code: Select all
\begin{document}
\maketitle
\section{Introduction} SNVs (Single Nucleotide Variant) are the most common form of the variant of intra-species. Genome project aim to provide deep characterization of the human genome sequence variation as a foundation to investigate the relationship between genotype and phenotype (Durbin et al., 2010). and migration pattern of ancient human. Now SNVs are used as a marker in genome wide association study that are identified over 3300 common SNVs which correlate to the phenotypes of disease.
Discovering the functional impact of SNVs that contribute to the disease susceptibilities and drug sensitivities. Are the main goals of modern genetics and genome studies.
From biological side the SNVs have many functional impact, it could affect to the transcriptional machinery of the cell in a region od DNA that contain signals recognized by transcription factor or to the splice site or If it occurs at a splice site or a site where exonic or intronic splicing enhancers or repressors bind or it may affect to the transcriptional machinery.
SNVs that cause alteration in protein sequences also interfere with the conformation of tertiary structure .
The role of bioinformatics is indirectly through proxies. Biologist interested in SNVs where it’s affect on transcriptional gene that represent site directed mutagenesis on genomic DNA then transfer the mutated DNA to the cell culture and use readouts that result from the activity of transcriptional gene to measure changes that related to wild-type , in contrast bioinformatics based typically approaches include computational analysis for DNA sequence around SNV. Most of their works Concentrated on SNVs on protein coding that is called cSNV or nsSNV and its related to the human healt.
Most SNVs found distributed differently in cases and controls in GWAS studies are not in protein exons and it's not found in protein coding region.
New methods to predict which of these putative regulatory SNVs (known as rSNVs) may be consequential.
\section{Classical Approaches} In the genetic and cancer field there is a move to use SNVs instead of SNP as a broader term ,which encompasses both common and rare variant.
\section{Impact of SNVs}
\textit{Properties of amino acid residue substitution:}
Properties of amino acid residue substitution can contribute to predict the effect of SNVs because of the evolutionary distance between pairs of amino acid .
PAM matrices: This approach approximate the distance according to the between pairs of amino acid .
BLOSUM matrices: this approach included distantly species but only consider highly conserved position of protein.
These two approaches using raw mutation rate to calculate the score for each amino acid substitution that determine the ability to produce evolutionary trajectory over the time rather by the chance.
If the substitution is conserved, that mean it's less likely to be disruptive. while if it is not conserved , that are more likely to be deleterious.
There is another approaches to consider another properties of amino acid like biophysical , changing in volume , Hydrophobicity ,net charges, packing density and solvent accessibility, these properties correlate with functional impact of cSNV.
Grantham distance hypothesized that biophysical importance of amino acid substitution could be quantified in 3d pace and that can be represented by Euclidean distance.
But these metrics not sufficient to make accurate prediction about cSNV reliably.\\
\textbf{The evolutionary history of an amino acid position:}\\
BLOSUM and PAM scores: were used to estimate an evolutionary distance separating a pair of amino acids.
The aim of considering the composition of amino acid at equivalent position of protein family because functionally or structurally important positions will not tolerate a variety of amino acids.
Shannon entropy:quantifies how surprised we are by the distribution of all amino acids in a column. Highly conserve columns present few surprises while columns of little conservation present more surprises.
Relative entropy: Comparing Shannon entropy of a column with the Shannon entropy of the amino acid background distribution.
Many scores incorporate both properties of amino acid
substitution and evolutionary conservation.
The SIFT score: is a weighted average of the frequency with which a variant amino acid residue appears in the multiple alignment column, and an estimate of unobserved frequencies via Dirichlet mixture pseudocounts (Sjander et al., 1996).
The PSIC score :considers the difference between the
Likelihood of the reference and variant amino acid at an alignment column ,after encoding of the alignment in a Position-Specific Scoring Matrix (PSSM).
AGVGD score: is a position specific variant of the Grantham distance, where the Grantham Variation GV is computed by replacing each pair of components representing composition, polarity and charge with the maximum and minimum value in the alignment column.
The Grantham Deviation: measures the extent that a variant amino acid deviates from the range of variation seen in the column, an estimate of its violation of evolutionary constraints on the protein position.
The MAPP score: uses a statistical summary of an alignment column by constructing a phylogenetic tree and weighting each sequence by tree topology and branch lengths .
The mean and variance of amino acid physiochemical properties in the summarized column are used to estimate position-specific constraints on amino acid substitution in a biologically meaningful way.
(PCA) principal components analyses :transforms the properties into de-correlated components, which are used to generate an integrated score that measures constraint violations with respect to all of the amino acid properties.
\textbf{Sequence-function relationships:}
There is essential web service for bioinformatics like UniProtKB database that have information about the sequence function and this can be useful for assessing whether cSNV is functional and have a feature of maintain table attribute for each curated protein to annotate region and specific sites of the interest, and these feature used to identify whether cSNV appear in a region that are sensitive to amino acid substitution, biologically important sequence motifs,
active site residues, metal-binding sites, sites of post-translational modifications, and lipid-binding sites.
In some cases the results of mutational functional testing in a position included as a feature.
\textbf{Structure-function relationships}
It is possible to compute a number of properties that are useful in predicting the functional effect of cSNV if it is can be mapped based on experimentally structure of protein or high quality homologues model .\\
Solvent accessibility: it is mean the ability of water to touch some molecules on the surface of protein, this property consider one of the strongest predictors of functional impact of cSNV because any substitution in the hydrophobic of soluble protein can disrupt thermodynamic stability.\\
Structural modeling of mutant :used to assess wither cSNV induce backbone strains, which lead to over-packing and occur in cavities, or impact key pair-wise residue interactions.
X-ray crystal help to interact with pair of protein or small molecule, nucleotide, and peptide ligands.
Thus the ability to determine cSNV portion in protein structure make it able to assess, in some cases if the changes appear near catalyst region or binding site or at a domain-domain interface in a protein complex.
In other case we can use electrostatic analysis for protein surface or model to reflect highly charged patches that disrupted through cSNV that may affect binding interaction.
\section{Impact of rSNVs}
SNV have regulatory Impact in many way by disrupting chromatin structure to cause losing in post translational modification site. \\
\textbf{Transcription}
SNVs interfere with the regulation during disrupting transcription factor binding sites (TFBSs) , in the human genome most of these sites derived by PSSMs from Database resources such as JASPAR and TANSFACT, each PSSMs describe a statistical profile for sequence bound during a given transcription factor, up to the experimental studies from sources such as ChIP-seq.
PSSM, allow researcher to predict TFBSs profile from limited sets of observations by supposing that each site is independent.
Some sites in TFBSs could be highly conserved , which reflect transcription factors that depend a specific nucleotide to proper binding. While other is more divergent. This is captured by the relative entropy of each column in the PSSM.
A SNV that altered conserved site is more likely to be deleterious than one that alters a divergent site. A number of transcriptional factors work as a modification factors for chromatin like CTCF. In such case SNVs can be consequential because of the ability to regulate not simply individual genes but also a large region of chromatin.
Lose or gain of a predicted TFBS might not be consequential for many reasons , PSSMs tend to be low specificity predictors because transcription factor binding motifs are short and degenerate.
\textbf{Pre-mRNA splicing}
SNV affect splicing that tend to be consequential the associated disease to a single point mutation . SNVs can alter splicing by disrupting an existing splice site and creating new one inside an exon, and adding and removing splicing regulatory motifs like Exon Splicing Enhancers and Silencers . SNVs in exons can also introduce premature termination codons, that lead to little or no protein. The analysis of the splice site that have SNV close to this a key component in SNV function prediction.
PSSMs is consider an effective matrix for capturing the importance of each splice site position . SNVs that introduce premature termination codons have a high likelihood of being consequential.
\textbf{MicroRNA binding}
The impact of SNV to regulation are by altering (miRNA) binding sites, miRNA perform a diversity functions but most of them are implicated in mRNA silencing through translational repression or cleavage.
SNVs in this region have greater impact than SNVs elsewhere in the binding site. The impact of SNVs in the binding site depend on the entire binding site.
\textbf{Altering post-translational modification sites}
The activity of the protein in the eukaryote is through the modification of post-transitional modification(PTM), this PTM is predicted by :\\
1)experimental measurement
2)prediction
To detect PTM sites there is recent advances in mass spec proteomic have dramatically improved.
To evaluate the PTM site annotations , still onus on the researchers to understand how the prediction is derived and what is the limitations of the method.
\section{Bioinformatics predictors under the hood}
\textbf{Single vs. multiple feature strategies:}
Some functional predictors of cSNV and rSNV depend on a single feature, but the popular approach is to apply data integration, which combine multiple features into a single predictive score.
If two features are highly correlated then there is probably but not much to using both of them. But uncorrelated features or independent one are expected to increase predictor accuracy. It make sense to design a predictor to identify a set of highly informative and independent features but researcher take an empirical approach to ascertain which features are useful. The best feature which able to distinguish between the functional impact of SNVs and neutral SNVs which don’t have functional impact. The collection of SNVs from both categories require intelligent feature selection . If the collection is larger enough that mean it will be sufficiently representative for this variety, but if it’s too small or contaminated by mislabeled SNVs that will not make us to ascertain if the selected feature is useful or not.
\textbf{Benchmark sets:}
Researcher use two approaches to assemble collection of functional and neutral SNVs, these approaches functional assay results and data-mining.
The functional SNVs that associated with disease, these disease associations are with respect to the mendelian or monogenetic disease, can be collected from curated database like SwissProt Variant, OMIM, HGMD or its obtained from clinical or functional studies.
For the early predictors the used different approaches and used the results of saturation mutagenesis experiments in bacterial and viral proteins as benchmark, these benchmark set assume that there is relationship between SNVs and the phenotype effect.
In the experiment of benchmark its assumed the measured impact on a single molecular function is directly coupled with disease.
In these assumption , researchers should remain aware of underlying uncertainties
\textbf{Supervised learning:}
The most data-driven approach to predict function of SNV are used both collection of function and neutral SNVs to assess the most relevant predictive feature and to train classification algorithm, benchmark as well used as a training set.
The supervised statistical learning algorithm is the most common used , where each SNVs is represented by multiple futures and a class label.
These algorithms detect patterns associated with each class and learn a decision rule, which is subsequently applied to SNVs , whose class membership is unknown.
Supervised learners is consider a successful if the decision rule yielded from the training phase is generalizable or able to predict the class .
\section{BUYER BEWARE}
There a wide range of bioinformatics methods available via web interfaces , they are easy to use to assess SNVs, it make sense to decide which of these are a good choice for a particular purpose. But the problem that they apply as a blackbox without solid understanding , and that is dangerous strategy from practical and scientific perspective , while its recommended to study the underlying these methods carefully and this method should be subject scientific publications because they explain the assumptions made by method.
Any method that uses protein sequence alignments to evaluate cSNVs will be biased by the sequence that are included in the alignment.
To identify the biologically importance of conservation signals that require samples of both closely and distantly related sequences.
If the alignment contain sequences that are too distantly related to SNV containing sequence, or it may contain residue that is deleterious if substituted into the equivalent positions in human.
almost all positions will appear to be conserved whose are functionally important.
SIFT algorithm is designed to avoid pitfall by selecting a diverse of homolog’s.
If prediction method involves feature selection/ supervised machine learning , information leak introduced during testing may yield an overly optimistic evaluation of performance.
\begin{figure}[h!]
\centering
\includegraphics[scale=0.8]{f1_large.JPG}
\caption{Flow chart for informed use of SNV function prediction tools}
\label{threadsVsSync}
\end{figure}
\section{CONCLUSION}
SNV is became high prioritization in the age of personal genomics(branch of genomics).
This task is now easy using SNV meta-server, which is essentially black-box to automate execution of multiple assessments, and that lead to the functional SNV. Anyone whose used black-box tools should understand the underlying method and what their limitation, one of these limitations is shared by all of these methods . For example, current ESS predictors are based on only one family of splicing factors
but there is no reason to believe that these proteins represent the complete universe of ESSs.
There is a new technology ChIP-seq represent a significant advances in evaluating the impact of SNVs on DNA binding , this technology will facilitae research on the impact of SNVs on DNA methylation, chromatin and transcription factor binding, and RNA processing (through CLIP-seq).
\bibliographystyle{abbrv}
\bibliography{references}
\end{document}
Code: Select all
@book{
title={ A map of human genome variation from population-scale sequencing },
author={ Durbin, R. M., Abecasis, G. R., Altshuler, D. L., Auton, A., Brooks, L. D., Durbin,R. M., Gibbs, R. A., Hurles, M. E., and McVean, G. A.},
isbn={467(7319),1061–1073},
url={http://www.1000genomes.org/sites/1000genomes.org/files/docs/nature09534.pdfJ},
year={2010},
publisher={yyy},
title={ Using bioinformatics to predict the functional impact of SNVs },
author={ Melissa Cline , Rachel Karchin},
isbn={467(7319),1061–1073},
url={http://bioinformatics.oxfordjournals.org/content/27/4/441.full.pdf+html},
year={2010},
publisher={Dr. Jonathan Wren},
}