Initial Cluster Analysis

doi:10.1089/cmb.2017.0050

. 2018 Feb;25(2):121-129.

doi: 10.1089/cmb.2017.0050. Epub 2017 Aug 3.

Initial Cluster Analysis

Stephen F Altschul¹, Andrew F Neuwald²

Affiliations

¹ 1 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health , Bethesda, Maryland.
² 2 Department of Biochemistry and Molecular Biology, Institute for Genome Sciences, University of Maryland School of Medicine , Baltimore, Maryland.

PMID: 28771374
PMCID: PMC5806593
DOI: 10.1089/cmb.2017.0050

Initial Cluster Analysis

Stephen F Altschul et al. J Comput Biol. 2018 Feb.

. 2018 Feb;25(2):121-129.

doi: 10.1089/cmb.2017.0050. Epub 2017 Aug 3.

Authors

Stephen F Altschul¹, Andrew F Neuwald²

Affiliations

¹ 1 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health , Bethesda, Maryland.
² 2 Department of Biochemistry and Molecular Biology, Institute for Genome Sciences, University of Maryland School of Medicine , Baltimore, Maryland.

PMID: 28771374
PMCID: PMC5806593
DOI: 10.1089/cmb.2017.0050

Abstract

We study a simple abstract problem motivated by a variety of applications in protein sequence analysis. Consider a string of 0s and 1s of length L, and containing D 1s. If we believe that some or all of the 1s may be clustered near the start of the sequence, which subset is the most significantly so clustered, and how significant is this clustering? We approach this question using the minimum description length principle and illustrate its application by analyzing residues that distinguish translational initiation and elongation factor guanosine triphosphatases (GTPases) from other P-loop GTPases. Within a structure of yeast elongation factor 1[Formula: see text], these residues form a significant cluster centered on a region implicated in guanine nucleotide exchange. Various biomedical questions may be cast as the abstract problem considered here.

Keywords: Jeffreys' priors; Minimum Description Length principle; cluster analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that there are no competing financial interests.

Figures

<b>FIG. 1.</b> — **FIG. 1.**
The optimization of X and D₁ using Jeffreys' and Flattened priors. **(A)** Histogram for the optimal cut point X from random sequences with and . Bins collect results for X from 1 to 30, 31 to 60, and so on. **(B)** Histogram for the optimal cut point X from random sequences with , 35 1s within the initial 200 positions, and 40 1s within the terminal 401 positions. **(C)** Histogram for the optimal D₁ from the same experiment as for **(B)**.

formula image — **FIG. 1.**
The optimization of X and D₁ using Jeffreys' and Flattened priors. **(A)** Histogram for the optimal cut point X from random sequences with and . Bins collect results for X from 1 to 30, 31 to 60, and so on. **(B)** Histogram for the optimal cut point X from random sequences with , 35 1s within the initial 200 positions, and 40 1s within the terminal 401 positions. **(C)** Histogram for the optimal D₁ from the same experiment as for **(B)**.

<b>FIG. 2.</b> — **FIG. 2.**
Observed p values as a function of calculated p values P. random sequences were generated for each of and , and , , and . Jeffreys' and Flattened prior optimizations are represented by circles and crosses, respectively.

<b>FIG. 3.</b> — **FIG. 3.**
Initial cluster analysis of residues within the yeast elongation factor eEF1A GTPase domain bound to the nucleotide exchange factor eEF1B (pdb_id: 1g7c). Color scheme: eEF1A GTPase domain, *green*; eEF1A switch I and II regions, *brown*; eEF1A domains II and III, *gray*; eEF1B, *marine blue*; GMP, *cyan*; side chains of TIEF-specific and GTPase-conserved residues included in initial clusters, *red and yellow*, respectively (two side chains common to both clusters are colored *red*). K205 of eEF1B and G70 of eEF1A, which were used as (alternative) focal points, are indicated. GMP, guanosine-5′-monophosphate; TIEF, translation initiation and elongation factor.

<b>FIG. 4.</b> — **FIG. 4.**
Cut points obtained using Jeffreys' priors versus Flattened priors. The same TIEF-specific analysis was performed as for Figure 3, except that G70 of eEF1A was used as a focal point instead of K205 of eEF1B. The *black diamond* represents the index residue G70; *black dots* and *open circles* represent 1s and 0s (i.e., discriminating and nondiscriminating residues), respectively. The fraction of discriminating residues in each initial cluster is shown parenthetically.

See this image and copyright information in PMC

Cited by

SPARC: Structural properties associated with residue constraints.
Neuwald AF, Yang H, Tracy Nixon B. Neuwald AF, et al. Comput Struct Biotechnol J. 2022 Apr 7;20:1702-1715. doi: 10.1016/j.csbj.2022.04.005. eCollection 2022. Comput Struct Biotechnol J. 2022. PMID: 35495120 Free PMC article.
Inferring joint sequence-structural determinants of protein functional specificity.
Neuwald AF, Aravind L, Altschul SF. Neuwald AF, et al. Elife. 2018 Jan 16;7:e29880. doi: 10.7554/eLife.29880. Elife. 2018. PMID: 29336305 Free PMC article.
Identifying Function Determining Residues in Neuroimmune Semaphorin 4A.
Chapoval SP, Lee M, Lemmer A, Ajayi O, Qi X, Neuwald AF, Keegan AD. Chapoval SP, et al. Int J Mol Sci. 2022 Mar 11;23(6):3024. doi: 10.3390/ijms23063024. Int J Mol Sci. 2022. PMID: 35328445 Free PMC article.
Statistical investigations of protein residue direct couplings.
Neuwald AF, Altschul SF. Neuwald AF, et al. PLoS Comput Biol. 2018 Dec 31;14(12):e1006237. doi: 10.1371/journal.pcbi.1006237. eCollection 2018 Dec. PLoS Comput Biol. 2018. PMID: 30596639 Free PMC article.

References

1. Andersen G.R., Valente L., Pedersen L., et al. . 2001. Crystal structures of nucleotide exchange intermediates in the eEF0A-eEF1B complex. Nat. Struct. Biol. 8, 531–534 - PubMed
1. Durbin R., Eddy S., Krogh A., et al. . 1998. Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, England
1. Fischer J.D., E., Mayer C., and Söding J. 2008. Prediction of protein functional residues from sequence by probability density estimation. Bioinformatics 24, 613–620 - PubMed
1. Grünwald P.D. 2007. The Minimum Description Length Principle. MIT Press, Cambridge, MA
1. Hall A. 2000. GTPases. Oxford University Press, Oxford, England

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

R01 GM125878/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- Saccharomyces Genome Database
Miscellaneous
- NCI CPTAC Assay Portal

[1] Andersen G.R., Valente L., Pedersen L., et al. . 2001. Crystal structures of nucleotide exchange intermediates in the eEF0A-eEF1B complex. Nat. Struct. Biol. 8, 531–534 - PubMed

[2] Andersen G.R., Valente L., Pedersen L., et al. . 2001. Crystal structures of nucleotide exchange intermediates in the eEF0A-eEF1B complex. Nat. Struct. Biol. 8, 531–534 - PubMed

[3] Durbin R., Eddy S., Krogh A., et al. . 1998. Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, England

[4] Durbin R., Eddy S., Krogh A., et al. . 1998. Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, England

[5] Fischer J.D., E., Mayer C., and Söding J. 2008. Prediction of protein functional residues from sequence by probability density estimation. Bioinformatics 24, 613–620 - PubMed

[6] Fischer J.D., E., Mayer C., and Söding J. 2008. Prediction of protein functional residues from sequence by probability density estimation. Bioinformatics 24, 613–620 - PubMed

[7] Grünwald P.D. 2007. The Minimum Description Length Principle. MIT Press, Cambridge, MA

[8] Grünwald P.D. 2007. The Minimum Description Length Principle. MIT Press, Cambridge, MA

[9] Hall A. 2000. GTPases. Oxford University Press, Oxford, England

[10] Hall A. 2000. GTPases. Oxford University Press, Oxford, England

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Initial Cluster Analysis

Affiliations

Initial Cluster Analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous