The design of PSI-BLAST
An example
Notes on using PSI-BLAST
Adapted from:
Exercise
References
|
|
Introduction
Many functionally and evolutionarily important protein similarities are
recognizable only through comparison of three-dimensional structures [1,2].
When such structures are not available, patterns of conservation identified
from the alignment of related sequences can aid the recognition of distant
similarities. There is a large literature on the definition and construction
of these patterns, which have been variously called motifs, profiles,
position-specific score matrices, and Hidden Markov Models [3-11].
In essence, for each position in the derived pattern, every amino acid
is assigned a score. If a residue is highly conserved at a particular
position, that residue is assigned a high positive score, and others are
assigned high negative scores. At weakly conserved positions, all residues
receive scores near zero. Position-specific scores can also be assigned to
potential insertions and deletions [4,9,11].
The power of profile methods can be further enhanced through iteration of
the search procedure [6-8,10]. After a profile is run against a database,
new similar sequences can be detected. A new multiple alignment, which
includes these sequences, can be constructed, a new profile abstracted,
and a new database search performed. The procedure can be iterated as
often as desired or until convergence, when no new statistically significant
sequences are detected.
Iterated profile search methods have led to biologically important observations
but, for many years, were quite slow and generally did not provide precise
means for evaluating the significance of their results. This limited their
utility for systematic mining of the protein databases. The principal design
goals in developing the Position-Specific Iterated BLAST (PSI-BLAST) program
[10] were speed, simplicity and automatic operation. The procedure PSI-BLAST
uses can be summarized in five steps:
(1) PSI-BLAST takes as an input a single protein sequence and compares
it to a protein database, using the gapped BLAST program [10].
|
(2) The program constructs a multiple alignment, and then a profile,
from any significant local alignments found. The original query
sequence serves as a template for the multiple alignment and profile,
whose lengths are identical to that of the query. Different numbers
of sequences can be aligned in different template positions.
|
(3) The profile is compared to the protein database, again seeking
local alignments. After a few minor modifications, the BLAST
algorithm [10,12] can be used for this directly.
|
(4) PSI-BLAST estimates the statistical significance of the local
alignments found. Because profile substitution scores are
constructed to a fixed scale [13], and gap scores remain independent
of position, the statistical theory and parameters for gapped BLAST
alignments [14] remain applicable to profile alignments [10].
|
(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary
number of times or until convergence.
|
Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension
of BLAST; the results produced in iterative search steps are comparable to
those produced from the first pass. Unlike most profile-based search methods,
PSI-BLAST runs as one program, starting with a single protein sequence, and
the intermediate steps of multiple alignment and profile construction are
invisible to the user.
PSI-BLAST uncovers many protein relationships missed by single-pass database-
search methods and has identified relationships that were previously detectable
only from information about the three-dimensional structure of the proteins
[10,15,16]. Here, we illustrate how to operate PSI-BLAST by using a comparison
of proteins from thermophilic archaea and bacteria as an example [17]. We
employ the WWW version of PSI-BLAST.
Use Entrez to find the sequence of the uncharacterized protein MJ0414 from
Methanococcus jannaschii [18] in FASTA format, and paste it into the PSI-BLAST
Web page. At this point, you may immediately press the Submit Query button
or, instead, first tailor the search. For example, you may change the
substitution and gap costs, or the cutoff E-value that PSI-BLAST uses when
constructing a profile for the next iteration. This default E-value is the
rather conservative 0.001. Change it here to 0.01.
Examine the results of the program's initial gapped BLAST search. The only
significant hits are very strong ones to the query sequence itself, and to
uncharacterized proteins from three other archaea and the thermophilic
bacteria Aquifex aeolicus. However, iterating the search by using the
derived profile uncovers yeast DNA ligase II [19] with E-value 0.005,
which is moderately significant. If you have used 0.01 as the cutoff
E-value for recruitment of alignments into successive profiles, the ligase
sequence is included at this stage. If you left the cutoff E-value at 0.001,
PSI-BLAST reports convergence because no new sequences have alignments that
pass this threshold. Nevertheless, by checking the box next to the yeast
DNA ligase, you can force its inclusion in the construction of a PSI-BLAST
profile, and run another iteration. Because a ligase has been used in
constructing the query, the next iteration produces many highly significant
alignments that involve other DNA ligases.
How do we interpret these results? Once a single sequence from a highly
conserved family (here, the DNA ligases) is used in constructing a profile,
the rest of the family will almost certainly be retrieved (and have E-values
of high significance) in subsequent iterations. Impressive E-values for
sequences retrieved in later iterations depend upon the validity of earlier
inferences and therefore should not be taken as automatic proof of homology.
In the example considered here, the best evidence for a possible relationship
between the thermophile protein family and DNA ligases is the alignment
produced in the first PSI-BLAST iteration (E = 0.005). This should be taken
as a hint that requires corroboration. Fortunately, the PSI-BLAST alignment
of our uncharacterized protein and yeast DNA ligase here provides such
corroboration (Fig. 1). The best-conserved portions of the alignment
correspond perfectly to the set of conserved motifs identified in ATP-
dependent DNA ligases [20], including the catalytic lysine residue that
forms a covalent adduct with AMP ([17].
The WWW version of PSI-BLAST requires the user to decide after each iteration
whether to continue. In some respects this is a limitation, but it has the
advantage that the user can hand-pick the sequences used for each profile
construction, regardless of E-value, by checking boxes next to the sequences'
descriptions. A stand-alone version of PSI-BLAST (obtainable from NCBI by
anonymous FTP at ftp://ncbi.nlm.nih.gov/blast/executables/) allows the user
to run the program for a chosen number of iterations or until convergence;
it also allows the user to save the profile produced and use it subsequently
to search another database.
PSI-BLAST is a powerful tool, but its use requires caution. The sources of
error are the same as for standard BLAST but are easily amplified by iteration.
The major source of deceptive alignments is the presence within proteins of
regions with highly biased amino acid composition [21]. If such a region
is included during production of a profile, otherwise unrelated sequences
containing similarly biased regions will probably creep in during subsequent
iterations, rendering the search nearly worthless. PSI-BLAST filters out
biased regions of query sequences by default, using the SEG program [21].
Because the SEG parameters have been set to avoid masking potentially
important regions, some bias may persist; PSI-BLAST can thus still generate
compositionally rooted artifacts. These cases usually can be identified by
inspection - especially when sequences that have a known bias, such as myosins
or collagens, are retrieved. SEG (ftp://ncbi.nlm.nih.gov/pub/seg/seg/) can be
used with parameters that eliminate nearly all biased regions [21], and the
user can apply locally other filtering procedures, such as COILS [22] (which
detects coiled-coil regions), before submitting the appropriately masked
sequence to PSI-BLAST.
Use Entrez to find the C-terminal region (approximately 215 residues)
of human BRCA1 (SWISS-PROT accession number P38398) [23]. Search the
NR protein database with this sequence using PSI-BLAST. What do the Xs
in some alignments represent? Can the search be modified so that they
do not appear? How many PSI-BLAST iterations can be performed before
convergence? If dubious similarities pass the threshold for inclusion
in profile construction during a given iteration, try removing them and
check whether they reappear with significant similarity in the subsequent
iteration. For published analyses of some of these similarities, see
[10,24-26].
Altschul, S.F. & Koonin, E.V. (1998) "Iterated profile searches with
PSI-BLAST - a tool for discovery in protein databases." Trends Biochem.
Sci. 23, 444-447.
[1] Holm, L. & Sander, C. (1997) "New structure - novel fold?" Structure
5:165-171. (PubMed)
[2] Brenner, S.E., Chothia, C. & Hubbard, T.J.P. (1998) "Assessing sequence
comparison methods with reliable structurally identified distant
evolutionary relationships." Proc. Natl. Acad. Sci. USA 95:6073-6078. (PubMed)
[3] Schneider, T.D., Stormo, G.D., Gold, L. & Ehrenfeucht, A. (1986)
"Information content of binding sites on nucleotide sequences."
J. Mol. Biol. 188:415-431. (PubMed)
[4] Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) "Profile analysis:
detection of distantly related proteins." Proc. Natl. Acad. Sci. USAR
84:4355-4358. (PubMed)
[5] Staden, R. (1988) "Methods to define and locate patterns of motifs in
sequences." Comput. Appl. Biosci. 4:53-60. (PubMed)
[6] Gribskov, M. (1992) "Translational initiation factor-IF-1 and
factor-EIF-2-alpha share an RNA-binding motif with prokaryotic ribosomal
protein-S1 and polynucleotide phosphorylase." Gene 119:107-111. (PubMed)
[7] Tatusov, R.L., Altschul, S.F. & Koonin, E.V. (1994) "Detection of
conserved segments in proteins: Iterative scanning of sequence databases
with alignment blocks." Proc. Natl. Acad. Sci. USA 91:12091-12095. (PubMed)
[8] Yi, T-M. and Lander, E.S. (1994) "Recognition of related proteins by
iterative template refinement (ITR)." Prot. Sci. 3:1315-1328. (PubMed)
[9] Bucher, P., Karplus, K., Moeri, N. & Hofmann, K. (1996) "A flexible motif
search technique based on generalized profiles." Comput. Chem. 20:3-23. (PubMed)
[10] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller,
W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs." Nucleic Acids Res. 25:3389-3402. (PubMed)
[11] Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998) "Biological
Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids."
Cambridge University Press, Cambridge, UK.
[12] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990)
"Basic local alignment search tool." J. Mol. Biol. 215:403-410. (PubMed)
[13] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical
significance of molecular sequence features by using general scoring
schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. (PubMed)
[14] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth.
Enzymol. 266:460-480. (PubMed)
[15] Mushegian, A.R., Bassett, D.E. Jr., Boguski, M.S., Bork, P. & Koonin, E.V.
(1997) "Positionally cloned human disease genes: patterns of evolutionary
conservation and functional motifs." Proc. Natl. Acad. Sci. USA
94:5831-5836. (PubMed)
[16] Huynen, M., Doerks, T., Eisenhaber, F., Orengo, C., Sunyaev, S., Yuan, Y. &
Bork, P. (1998) "Homology-based fold predictions for Mycoplasma genitalium
proteins." J. Mol. Biol. 280:323-326. (PubMed)
[17] Aravind, L., Tatusov, R.L., Wolf , Y.I., Walker, D.R. and Koonin, E.V. (1998) "Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles." Trends Genet., 14:442-444 (PubMed)
[18] Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., Sutton,
G.G., Blake, J.A., FitzGerald, L.M., Clayton, R.A., Gocayne, J.D.,
Kerlavage, A.R., Dougherty, B.A., Tomb, J.F., Adams, M.D., Reich, C.I.,
Overbeek, R., Kirkness, E.F., Weinstock, K.G., Merrick, J.M., Glodek, A.,
Scott, J.L., Geoghagen, N.S.M. & Venter, J.C. (1996) "Complete genome
sequence of the methanogenic archaeon, Methanococcus jannaschii." Science
273:1058-1073. (PubMed)
[19] Sterky, F., Holmberg, A., Pettersson, B. & Uhlen, M. (1996) "The sequence
of a 30 kb fragment on the left arm of chromosome XV from Saccharomyces
cerevisiae reveals 15 open reading frames, five of which correspond to
previously identified genes." Yeast 12:1091-1095. (PubMed)
[20] Shuman, S. & Schwer, B. (1995) "RNA capping enzyme and DNA ligase: a
superfamily of covalent nucleotidyl transferases." Mol. Microbiol.
17:405-410. (PubMed)
[21] Wootton, J.C. & Federhen, S. (1996) "Analysis of compositionally biased
regions in sequence databases." Methods Enzymol. 266:554-571. (PubMed)
[22] Lupas, A. (1996) "Prediction and analysis of coiled-coil structures."
Methods Enzymol. 266:513-525. (PubMed)
[23] Miki, Y., Swensen, J., Shattuck-Eidens, D., Futreal, P.A., Harshman, K.,
Tavtigian, S., Liu, Q., Cochran, C., Bennett, L.M., Ding, W., Bell, R.,
Rosenthal, J., Hussey, C., Tran, T., McClure, M., Frye, C., Hattier, T.,
Phelps, R., Haugen-Strano, A., Katcher, H., Yakumo, K., Gholami, Z.,
Shaffer, D., Stone, S., Bayer, S., Wray, C., Bogden, R., Dayananth, P.,
Ward, J., Tonin, P., Narod, S., Bristow, P.K., Norris, F.H., Helvering, L.,
Morrison, P., Rosteck, P., Lai, M., Barrett, J.C., Lewis, C., Neuhausen,
S., Cannon-Albright, L., Goldgar, D., Wiseman, R., Kamb, A. & Skolnick,
M.H. (1994) "A strong candidate for the breast and ovarian cancer
susceptibility gene BRCA1." Science 266:66-71. (PubMed)
[24] Koonin, E.V., Altschul, S.F. & Bork, P. (1996) "BRCA1 protein products:
Functional motifs." Nature Genet. 13:266-268. (PubMed)
[25] Bork, P., Hofmann, K., Bucher, P, Neuwald, A.F., Altschul, S.F. & Koonin,
E.V. (1997) "A superfamily of conserved domains in DNA damage-responsive
cell cycle checkpoint proteins," FASEB J. 11:68-76. (PubMed)
[26] Callebaut, I. & Mornon, J.P. (1997) "From BRCA1 to RAP1: a widespread BRCT
module closely associated with DNA repair." FEBS Lett. 400:25-30. (PubMed)
|