Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Feb;25(2):121-129.
doi: 10.1089/cmb.2017.0050. Epub 2017 Aug 3.

Initial Cluster Analysis

Affiliations

Initial Cluster Analysis

Stephen F Altschul et al. J Comput Biol. 2018 Feb.

Abstract

We study a simple abstract problem motivated by a variety of applications in protein sequence analysis. Consider a string of 0s and 1s of length L, and containing D 1s. If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases. Within a structure of yeast elongation factor 1[Formula: see text], these residues form a significant cluster centered on a region implicated in guanine nucleotide exchange. Various biomedical questions may be cast as the abstract problem considered here.

Keywords: Jeffreys' priors; Minimum Description Length principle; cluster analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that there are no competing financial interests.

Figures

<b>FIG. 1.</b>
FIG. 1.
The optimization of X and D1 using Jeffreys' and Flattened priors. (A) Histogram for the optimal cut point X from formula image random sequences with formula image and formula image. Bins collect results for X from 1 to 30, 31 to 60, and so on. (B) Histogram for the optimal cut point X from formula image random sequences with formula image, 35 1s within the initial 200 positions, and 40 1s within the terminal 401 positions. (C) Histogram for the optimal D1 from the same experiment as for (B).
<b>FIG. 2.</b>
FIG. 2.
Observed p values formula image as a function of calculated p values P. formula image random sequences were generated for each of formula image and formula image, and formula image, formula image, and formula image. Jeffreys' and Flattened prior optimizations are represented by circles and crosses, respectively.
<b>FIG. 3.</b>
FIG. 3.
Initial cluster analysis of residues within the yeast elongation factor eEF1A GTPase domain bound to the nucleotide exchange factor eEF1Bformula image (pdb_id: 1g7c). Color scheme: eEF1A GTPase domain, green; eEF1A switch I and II regions, brown; eEF1A domains II and III, gray; eEF1Bformula image, marine blue; GMP, cyan; side chains of TIEF-specific and GTPase-conserved residues included in initial clusters, red and yellow, respectively (two side chains common to both clusters are colored red). K205 of eEF1Bformula image and G70 of eEF1A, which were used as (alternative) focal points, are indicated. GMP, guanosine-5′-monophosphate; TIEF, translation initiation and elongation factor.
<b>FIG. 4.</b>
FIG. 4.
Cut points obtained using Jeffreys' priors versus Flattened priors. The same TIEF-specific analysis was performed as for Figure 3, except that G70 of eEF1A was used as a focal point instead of K205 of eEF1Bformula image. The black diamond represents the index residue G70; black dots and open circles represent 1s and 0s (i.e., discriminating and nondiscriminating residues), respectively. The fraction of discriminating residues in each initial cluster is shown parenthetically.

Similar articles

Cited by

References

    1. Andersen G.R., Valente L., Pedersen L., et al. . 2001. Crystal structures of nucleotide exchange intermediates in the eEF0A-eEF1Bformula image complex. Nat. Struct. Biol. 8, 531–534 - PubMed
    1. Durbin R., Eddy S., Krogh A., et al. . 1998. Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, England
    1. Fischer J.D., E., Mayer C., and Söding J. 2008. Prediction of protein functional residues from sequence by probability density estimation. Bioinformatics 24, 613–620 - PubMed
    1. Grünwald P.D. 2007. The Minimum Description Length Principle. MIT Press, Cambridge, MA
    1. Hall A. 2000. GTPases. Oxford University Press, Oxford, England

Publication types

MeSH terms

Substances

LinkOut - more resources