Estimation of statistical parameters for local alignment scores
Generalization to PSI-BLAST alignment scores
References
|
|
As discussed previously, computation experiments [1-8] strongly suggest that
the optimal gapped local alignment scores produced by the Smith-Waterman
algorithm [9] and approximated by FASTA [10] or Gapped BLAST [6,7] follow
an extreme value distribution [11]. Specifically, the probability that the
optimal score S from the comparison of unrelated proteins is at least x is
given by the equation
.gif)
where K and lambda are statistical parameters dependent upon the scoring
system and the background amino acid frequences of the sequences being
compared. While FASTA estimates these parameters from the scores generated
by actual database searches, BLAST estimates them beforehand for specific
scoring schemes by comparing many random sequences generated using a standard
protein amino acid composition [12].
For example, using BLOSUM-62 amino acid substitution scores [13], and affine
gap costs [14-16] in which a gap of length k is assigned a score of -(10 + k),
we generated 10,000 pairs of length-1000 random protein sequences, and used
the Smith-Waterman algorithm to calculate 10,000 optimal local alignment
scores. From these scores, lambda was estimated at 0.252 and K at 0.035
by the method of maximum-likelihood [17]. In general, given M samples from
an extreme value distribution, the ratio of the maximum-likelihood estimate
of lambda to its actual value is approximately normally distributed, with mean
1.0 and standard deviation 0.78/sqrt(M) [17]. Thus the standard error for our
estimate of lambda is about 0.002, or less than 1%.
A plot of the optimal local alignment scores generated in this example is
shown in Figure 1, along with the maximum-likelihood extreme value curve.
The chi-squared goodness-of-fit test for these data, with 34 degrees of
freedom, is 25.6, which is lower than would be expected to occur by chance
87% of the time even were the theory precisely valid. To better illustrate
the theory's accuracy, Figure 2 shows P-values derived from formula (1), or
by integrating the curve in Figure 1. The solid curve represents the data,
and the dotted curve the theory. It is difficult to see any space between
them.
In order for PSI-BLAST to iterate automatically, it needs to be able to
generate accurate estimates of the statistical significance of the alignments
it produces. Unfortunately, there is no analytic theory with which to estimate
the statistical significance of a gapped local alignment of a profile and a
simple sequence. We may conjecture, however, that the empirical statistics
for optimal local alignments of two sequences may also apply to alignments
involving a profile and a sequence.
Even were this the case, we are faced with the problem of estimating the
statistical parameters lambda and K for profiles we have seen for the first
time. As before, this may be done by random simulation, but sufficiently
precise estimates require on the order of half an hour on a typical current
workstation. This is at least two orders of magnitude too long for use with
PSI-BLAST.
One hope is that if we construct the amino acid scores within each column
of a PSI-BLAST profile to the same "scale" [18,19] (i.e. with the same
ungapped lambda) as those for a standard amino acid substitution matrix,
and then use the same position-independent gap costs, the same gapped lambda
may result.
To review, for ungapped local alignments, any substitution matrix takes the
form
.gif)
where the qij are the target frequencies for aligned pairs of amino acids,
the pi are background frequencies, and the subscript for lambda indicates
it is the statistical parameter for ungapped local alignments [18,19]. For a
PSI-BLAST profile [7], each column has its own unique set of amino acid target
frequencies qi. Following (2), the amino acid scores for this column may
be constructed to the same "scale" by using the formula
.gif)
The hope is that, given a specific set of gap costs, the gapped lambda for
the PSI-BLAST profile will be the same as the gapped lambda for the standard
substitution matrix, which may be calculated in advance.
To test this hypothesis, we searched SWISS-PROT [20], release 34 (59,576
sequences; 21,219,450 residues) with the length-567 influenza A virus
hemagglutinin precursor [21] (SWISS-PROT accession number P03435) as query,
and captured the profile constructed by PSI-BLAST from the 128 local alignments
with E-value <= 0.01. This profile was constructed to have almost the same
ungapped lambda (i.e. 0.3176) as the BLOSUM-62 matrix. We then compared this
profile to 10,000 random sequences of length 567, generated using standard
background amino acid frequencies [12]. As before, a gap of length k was
assigned a score of -(10 + k). Counts of the optimal local alignment scores,
calculated using an appropriately modified version of the Smith-Waterman
algorithm, are plotted in Figure 3. Also shown is the corresponding maximum-
likelihood extreme value distribution, which has statistical parameters
lambda = 0.254 and K = 0.040. It is apparent that the distribution fits the
random trial reasonably well; a chi-squared goodness-of-fit test with 34
degrees of freedom has value 39.7, which is lower than one would expect 27%
of the time even were the theory precisely valid.
This supports the idea that our statistical theory applies to PSI-BLAST
alignments. Furthermore, the estimate for gapped lambda of 0.254 +- 0.002
agrees to within experimental error with the value 0.252 +- 0.002 previously
estimated for BLOSUM-62, using the same gap costs (see Table 1). Similar
agreement was obtained with a number of other protein sequences as initial
query (results not shown), and in all cases the much less important gapped K
parameter could be estimated accurately as well (Table 1). In general, values
of lambda for local profile alignments appear to differ by less than 2% from
the values for simple local sequence comparison. Using these precomputed
values for gapped lambda should thus entail an error of less one bit for
PSI-BLAST scores under 50 bits, corresponding to a factor of less than two
in the estimation of statistical significance.
The accuracy of PSI-BLAST statistics was further tested by the
comparison of PSI-BLAST profiles with a shuffled version of SWISS-PROT.
The results from eleven queries are shown in Table 2. For each, the query
was compared to the SWISS-PROT database, and a PSI-BLAST profile was
constructed from the resulting significant alignments. This profile was
then compared to a version of SWISS-PROT in which each sequence was shuffled.
The E-values predicted by the estimated PSI-BLAST parameters are pretty
close to the mark.
The accuracy of PSI-BLAST statistics might lead us to speculate that
the lambdas for gapped alignments are a function only of the scale of the
substitution matrix and the gap costs. Unfortunately, this is not the case,
as an examination of Table 3 demonstrates. It appears that the effect of
specific gap costs on lambda is a function at least of the relative entropy
of the substitution matrix, as well as its scale. The reason that PSI-BLAST
profile lambdas could be predicted so well by BLOSUM-62 is that these profiles
are constructed based upon BLOSUM-62 target frequencies, and therefore have
relative entropies roughly equivalent to that of BLOSUM-62. A more precise
theory of gapped alignment statistics, however, may be in the offing [22].
[1] Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical
distribution of nucleic acid similarities." Nucleic Acids Res. 13:645-656. (PubMed)
[2] Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of
protein sequence similarities." Comput. Appl. Biosci. 4:67-71. (PubMed)
[3] Mott, R. (1992) "Maximum-likelihood estimation of the statistical
distribution of Smith-Waterman local sequence similarity scores." Bull.
Math. Biol. 54:59-75.
[4] Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of
statistical significance for sequence database searches." Proc. Natl. Acad. Sci. USA 91:4625-4628. (PubMed)
[5] Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and
Poisson approximation." Stat. Sci. 9:367-381.
[6] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth.
Enzymol. 266:460-480. (PubMed)
[7] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. (PubMed)
[8] Pearson, W.R. (1998) "Empirical statistical estimates for sequence
similarity searches." J. Mol. Biol. 276:71-84. (PubMed)
[9] Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular
subsequences." J. Mol. Biol. 147:195-197. (PubMed)
[10] Pearson, W.R. & Lipman, D.J. (1988) "Improved tools for biological sequence
comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448. (PubMed)
[11] Gumbel, E.J. (1958) "Statistics of extremes." Columbia University Press,
New York, NY.
[12] Robinson, A.B. & Robinson, L.R. (1991) "Distribution of glutamine and
asparagine residues and their near neighbors in peptides and proteins."
Proc. Natl. Acad. Sci. USA 88:8880-8884. (PubMed)
[13] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from
protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919. (PubMed)
[14] Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad. Sci. USA 80:1382-1386.
[15] Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using
affine gap costs." Bull. Math. Biol. 48:603-616. (PubMed)
[16] Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space."
Comput. Appl. Biosci. 4:11-17. (PubMed)
[17] Lawless, J.F. (1982) "Statistical models and methods for lifetime data."
Chapter 4. John Wiley & Sons, New York, NY.
[18] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical
significance of molecular sequence features by using general scoring
schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. (PubMed)
[19] Altschul, S.F. (1991) "Amino acid substitution matrices from an information theoretic perspective." J. Mol. Biol. 219:555-565. (PubMed)
[20] Bairoch, A. & Apweiler, R. (1998) "The SWISS-PROT protein sequence data
bank and its supplement TrEMBL in 1998." Nucleic Acids Res. 26:38-42. (PubMed)
21] Jou, W.M., Verhoeyen, M., Devos, R., Saman, E., Fang, R., Huylebroeck, D.,
Fiers, W., Threlfall, G., Barber, C., Carey, N. & Emtage, S. (1980)
"Complete structure of the hemagglutinin gene from the human influenza
A/Victoria/3/75 (H3N2) strain as determined from cloned DNA." Cell
19:683-696. (PubMed)
[22] Mott, R. & Tribe, R. "Approximate statistics for gapped alignments."
J. Comp. Biol., in press.
|