Shannon entropy analysis (Shannon, 1948 ) is possibly the most sensitive
tool to estimate the diversity
of a system.
For a multiple protein sequence alignment,
the Shannon entropy (H) for every position is as follow:
Where Pi is the fraction of residues of amino acid type
i, and M is the number of amino acid types (20).
H ranges from 0 (only one residue in present at that position) to 4.322
(all 20 residues are equally represented in that position). Typically,
positions with H >2.0 are considerered variable, whereas those with H < 2
are consider conserved. Highly conserved positions are those with H <1.0 (Litwin and Jores, 1992).
A minimum number of sequences is however required (~100) for H to describe the
diversity of a protein family.
Simpson Diversity Index
The Simpson index is another diversity index calculated from genotype proportions. This index describes the chance that two genotypes
sampled at random and with replacement from
a community will be from the same species. The value of this index ranges between 0 and 1,
the greater the value, the greater the sample diversity. Below is the
formula used to estimate it:
Where ni is the number of residues of type i, N is the total number of residues counted and S is number of different symbol types per site.
Wu-kabat Variability coefficient
The Wu-Kabat variability coefficient is a well-established descriptor of the susceptibility of
an amino acid position to evolutionary replacements(1977). It highlights stretches of accentuated
amino acid variation. The variability coefficient is computed using the following formula:
Where N is the number of sequences in the alignment, k is the number of different amino acids at a given position
and n is the times that the most common amino acid at that position is present.
When this option is selected, a multiple sequence alignment (MSA) must be provided.
The alignment can either be pasted or uploaded from a file.
This program accepts the following MSA formats: Clustal,
FASTA and GCG/PileUp.
Only the standard 20 amino acids should be
included in the alignment. If other sequence characters are included
(e.g. X) the server will return an error message.
A typical example of Clustal alignment is the following:
CLUSTAL W (1.81) multiple sequence alignment
.**:*** *::* ** **::: *****:. ******** . * *** *:
:*: * : * * .* ***** *** * : **::*.
The same multiple sequence alignment in FASTA Format:
An example of GCG/PileUp alignment would be:
MSF: 95 Type: P Check: 5477 ..
Name: hla_a68w_1HSB oo Len: 95 Check: 5515 Weight: 0.0
Name: hla_a0201_1DUY oo Len: 95 Check: 4661 Weight: 10.0
Name: hla_b3501_1A1N oo Len: 95 Check: 4585 Weight: 10.0
Name: hla_b5301_1A1M oo Len: 95 Check: 4402 Weight: 10.0
Name: hla_b5101_1E27 oo Len: 95 Check: 4791 Weight: 10.0
Name: hla_b2701_1HSA oo Len: 95 Check: 3347 Weight: 10.0
Name: hla_cw3_1EFX oo Len: 95 Check: 4868 Weight: 10.0
Name: hla-cw4_1IM9 oo Len: 95 Check: 4736 Weight: 10.0
Name: mkb_2vaa oo Len: 95 Check: 4517 Weight: 10.0
Name: db-1BZ9 oo Len: 95 Check: 4055 Weight: 10.0
hla_a68w_1HSB SHSMRYFYTS VSRPGRGEPR FIAVGYVDDT QFVRFDSDAA SQRMEPRAPW
hla_a0201_1DUY SHSMRYFFTS VSRPGRGEPR FIAVGYVDDT QFVRFDSDAA SQRMEPRAPW
hla_b3501_1A1N SHSMRYFYTA MSRPGRGEPR FIAVGYVDDT QFVRFDSDAA SPRTEPRPPW
hla_b5301_1A1M SHSMRYFYTA MSRPGRGEPR FIAVGYVDDT QFVRFDSDAA SPRTEPRPPW
hla_b5101_1E27 SHSMRYFYTA MSRPGRGEPR FIAVGYVDDT QFVRFDSDAA SPRTEPRAPW
hla_b2701_1HSA SHSMRYFHTS VSRPGRGEPR FITVGYVDDT LFVRFDSDAA SPREEPRAPW
hla_cw3_1EFX SHSMRYFYTA VSRPGRGEPH FIAVGYVDDT QFVRFDSDAA SPRGEPRAPW
hla-cw4_1IM9 SHSMRYFSTS VSWPGRGEPR FIAVGYVDDT QFVRFDSDAA SPRGEPREPW
mkb_2vaa PHSLRYFVTA VSRPGLGEPR YMEVGYVDDT EFVRFDSDAE NPRYEPRARW
db-1BZ9 PHSMRYFETA VSRPGLEEPR YISVGYVDNK EFVRFDSDAE NPRYEPRAPW
hla_a68w_1HSB IRNTRNVKAQ SQTDRVDLGT LRGYYNQSEA GSHTIQMMYG CDVGS
hla_a0201_1DUY IGETRKVKAH SQTHRVDLGT LRGYYNQSEA GSHTVQRMYG CDVGS
hla_b3501_1A1N IRNTQIFKTN TQTYRESLRN LRGYYNQSEA GSHIIQRMYG CDLGP
hla_b5301_1A1M IRNTQIFKTN TQTYRENLRI ALRYYNQSEA GSHIIQRMYG CDLGP
hla_b5101_1E27 IRNTQIFKTN TQTYRENLRI ALRYYNQSEA GSHTWQTMYG CDVGP
hla_b2701_1HSA IRETQICKAK AQTDREDLRT LLRYYNQSEA GSHTLQNMYG CDVGP
hla_cw3_1EFX VRETQKYKRQ AQTDRVSLRN LRGYYNQSEA GSHIIQRMYG CDVGP
hla-cw4_1IM9 VRETQKYKRQ AQADRVNLRK LRGYYNQSED GSHTLQRMFG CDLGP
mkb_2vaa MRETQKAKGN EQSFRVDLRT LLGYYNQSKG GSHTIQVISG CEVGS
db-1BZ9 MRETQKAKGQ EQWFRVSLRN LLGYYNQSAG GSHTLQQMSG CDLGS
When this option is selected, a pdb file or a pdb code must be provided. A PDB file is a text file containing
the tridimensional (3D) coordinates of each atom in a protein. In addition, a PDB file may contain other information
on top of the coordinates abouth the protein and experimental settings used to obtain the 3D structure. PDBs are deposited in the
Brookhaven database under a 4 symbol code. By clicking
here you can see an example of a pdb file that corresponds to the structure of the biotinyl
domain of acetyl-coenzyme A.
For the pdb file input option, PVS will generate a multiple sequence
alignment from the sequence in the PDB file. Additionally, if a Chain identifier is given, the
program will select that chain from the PDB file. When no chain is provided, it will select the
first chain by default.
The reference sequence can either be a consensus sequence or the first sequence in the alignment.
The second choice is particularly useful if the user has additional information on a given sequence,
and wants to set it as the standard. By default, the consensus sequence is selected.
PVS can perform several output task such as: Plot variability
, Mask variability in sequence , Return Conserved fragments
or Map structural variability.
By default, PVS will plot the sequence variability.
When Mask variability in sequence or Return Conserved fragments
are selected, a variability threshold
must be provided.
This parameter has to be set within the
range of 0 to 4.3 (default is 1.0) when Shannon is the selected variability method, and within the range of 0 to 1 (default is 0.46) when
Simpson is the selected variability method.
Those positions with a variability value above the selected threshold
are filtered out.Positions with a variability value under the selected threshold are considered of low variability (highly conserved).
If both Shannon and Simpson methods are selected, PVS will proceed considering the variability threshold as for Shannon.
Plot variability consists of a graph of the sequence variability plotted against the selected
reference sequence in the alignment as shown below.However, if 'Map Structural Variability' has also been
selected, the sequence variability will be plotted against the sequence in the provided PDB file, when input is 'protein alignment'.
Please note that you must only select 'Plot Variability' as an output task if you wish have the sequence variability plotted
against the reference sequence in your protein alignment.
When several variability methods have been selected,
their graphs can be displayed by clicking on the method name
Mask variability in sequence
This option masks in the selected reference sequence those residues with a variability
greater or equal than the selected variability threshold. The variability
masked sequence is returned in FASTA format (Shown below). When the user clicks on the 'Run Epitope
Prediction' button, the returned FASTA sequence will be sent to the RANKPEP algorithm for the
anticipation of conserved T-cell epitopes.
This option identifies those fragments (minimum length selected by user) in the selected reference
sequence consisting of consecutive residues whose variability is under the variability threshold.
These fragments are returned in a table sorted by their position in the sequence alignment.
Since sequence variability provides a means by which some pathogens escape the immune system,
this option and that of the sequence variability masking are relevant for vaccine design considerations.
It is important however to notice that relevant antigenic regions can be composed of conserved and
variant regions. Unfortunatelly, these fragments will not appear in the conserved fragments ouput
if they do not have the minimum number of consecutives conserved residues selected by the user.
This parameter sets the minimun length of the fragment. Each of the fragment
residues has a H that is under the threshold value .
Only the longest stretch of residues with H under the threshold is listed.
Map structural variability
This option maps the sequence variability onto a representative 3D-structure, using the PDB file
must be enabled in the browser. By default, the 3-D structure is shown as 'wireframe', although other
display options can be selected by the user. For instance, in the image below, the selected option is
'trace'. The 'Back to original mapping' button will restore the sequence variability mapping when the
'Conserved Fragments' option has been selected and the user has clicked on a fragment to locate it
on the PDB file.
Shannon, C. E. (1948) The mathematical theory of communication.
The Bell system Technical Journal, 27, 379-423 & 623-656.
Kabat, E. A., Wu, T. T., and Bilofsky, H. (1977) Unusual distribution of amino acids
in complementarity-determing (hypervariable) segments of heavy and light chains
of immunoglobulins and their possible roles in specificity of antibody
combining sites.J. Biol. Chem.252, 6609-6616.
Litwin, S. and Jores, R. (1992) In theoretical and experimental insights
into immunology, (Edited by Perelson A. S. and Weisbuch G.), Springer-Verlag,
Last change: November 2007