nih-gov/www.ncbi.nlm.nih.gov/geo/info/cluster.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>GEO DataSet Cluster Analysis - GEO - NCBI</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <meta name="author" content="geo" />
    <meta name="keywords" content="NCBI, national institutes of health, nih, database, archive, central, bioinformatics,  biomedicine, geo, gene, expression, omnibus, chips, microarrays, oligonucleotide, array, sage, CGH" />
    <meta name="description" content="Gene Expression Omnibus (GEO) is a database repository of high throughput  gene expression data and hybridization arrays, chips, microarrays." />
    <meta name="ncbiaccordion" content="collapsible: true, active: false" />
    <meta name="ncbi_app" content="geo" />
    <meta name="ncbi_pdid" content="documentation" />
    <meta name="ncbi_page" content="GEO DataSet Cluster Analysis" />
    <link rel="shortcut icon" href="/geo/img/OmixIconBare.ico" />
    <link rel="stylesheet" type="text/css" href="/geo/css/reset.css" />
    <link rel="stylesheet" type="text/css" href="/geo/css/nav.css" />
    <link rel="stylesheet" type="text/css" href="/geo/css/info.css" />
    <script type="text/javascript" src="/core/jig/1.15.10/js/jig.min.js"></script>
    <script type="text/javascript" src="/geo/js/dd_menu.js"></script>
    <script type="text/javascript" src="/geo/js/info.js"></script>
    <script type="text/javascript">
                    jQuery.getScript("/core/alerts/alerts.js", function () {
                        galert(['#crumbs_login_bar', 'body &gt; *:nth-child(1)'])
                    });
                </script>
    <script type="text/javascript">
                    var ncbi_startTime = new Date();
                </script>
  </head>
  <body id="info" class="cluster">
    <div id="all">
      <div id="page">
        <div id="header">
    <div id="ncbi_logo">
        <a href="/">
            <img src="/geo/img/ncbi_logo.gif" alt="NCBI Logo" />
        </a>
    </div>
    <div id="geo_logo">
        <a href="/geo/"><img src="/geo/img/geo_main.gif" alt="GEO Logo" /></a>
    </div>
</div>
        <div id="nav_bar">
    <ul id="geo_nav_bar">
        <li><a href="#">GEO Publications</a>
            <ul class="sublist">
                <li><a href="/geo/info/GEOHandoutFinal.pdf">Handout</a></li>
                <li><a href="/pmc/articles/PMC10767856/">NAR 2024 (latest)</a></li>
                <li><a href="/pmc/articles/PMC99122/">NAR 2002 (original)</a></li>
                <li><a href="/pmc/?term=10767856,4944384,3531084,3341798,3013736,2686538,2270403,1669752,1619900,1619899,539976,99122">All publications</a></li>
            </ul>
        </li>
        <li><a href="/geo/info/faq.html">FAQ</a></li>
        <li><a href="/geo/info/MIAME.html" title="Minimum Information About a Microarray Experiment">MIAME</a></li>
        <li><a href="mailto:geo@ncbi.nlm.nih.gov">Email GEO</a></li>
    </ul>
</div>
        <div id="crumbs_login_bar"><a title="NCBI home page" href="/">NCBI</a> »
                            <a id="curr_page" title="GEO home page" href="/geo/">GEO</a> »
                            <a title="GEO documentation guide" href="/geo/info/">Info</a> »
                            <span>GEO DataSet Cluster Analysis</span><span id="login_status"><a href="/geo/submitter/" title="Click here to login. You need to do this only if you want to edit the contact information, submit data, see your unreleased data, or work with data already submitted by you. You do not need to login if you are here just to browse through public holdings">Login</a></span></div>
        <div id="content">
		<a id="top"></a>
		<h1>GEO DataSet Cluster Analysis</h1>

		<ul class="page_menu">
			<li><a href="#overview">Overview</a></li>
			<li><a href="#selection">Cluster region selection and visualization options</a></li>
			<li><a href="#input">Data input, filtering and transformation</a></li>
			<li><a href="#clustering">Clustering</a></li>
			<li><a href="#refs">References and acknowledgments</a></li>
		</ul>

		<a id="overview"></a>
		<h2>Overview</h2>

		<a href="/geo/gds/analyze/analyze.cgi?ID=GDS10">
			<img class="cluster_ex" src="/geo/img/clusterbig.gif" alt="GDS10 hierarchical cluster" title="GDS10 hierarchical cluster" width="400" height="500" />
		</a>

		<p>
The GEO DataSet cluster analysis program is a visualization tool for displaying cluster heat maps.
Cluster analyses are one of the most powerful methods to mine and visualize high-dimensional data.
They attempt to detect natural groups in data using a combination of distance metrics and linkages.
Columns (Samples), and independently, the rows (genes)
are rearranged to place rows with similar response patterns near each other and columns with
similar response patterns near each other. Cluster results are graphically represented as 'heat
maps' whereby high through low expression levels are presented as a two-color spectrum that
allows the user to easily identify groups of interesting genes through visual pattern recognition.
GEO cluster heat map images are interactive; cluster portions of interest may be selected, enlarged, charted as line plots, viewed in
<a href="/geoprofiles/">Entrez GEO Profiles</a>, and the original data downloaded.
The cluster analysis tool may be accessed from DataSet records under the "analysis"
pull-down menu, or by clicking the cluster thumbnail image.
		</p>
		<p>
Precomputed hierarchical clusters (single linkage, complete linkage, and average linkage/UPGMA),
as well as user-defined K-means/K-median clustering (where K = 2 through 15) are available.
Clusters are calculated using a variety of distance metrics
(Euclidean distance, Pearson correlation, or un-centered correlation coefficient).
To see an example, check out the
<a href="/geo/gds/analyze/analyze.cgi?ID=GDS10">GDS10 hierarchical cluster</a>
calculated with UPGMA/un-centered correlation.
		</p>
		<p class="highlight">
The clustering analyses provided by GEO help provide insight into the relationships between data.
It is recommended that care is taken with biological interpretation using cluster results.
GEO clusters  are automatically generated from submitter-supplied data using a common set of parameters.
Criteria such as sample size, data distribution, number of repeats,
prior- or post-filtering, and normalization factors are not considered.
It is for these reasons that data presented in GEO might differ from processed data reported in associated publications.
Alternative algorithms, normalization procedures and distance metrics
will generate different cluster outputs. For K-clustering, the
initialization procedure involves random assignment of genes to each partition,
so K-cluster results may be different on each run.
		</p>

		<a id="selection"></a>
		<h2>Cluster region selection and visualization options <a href="#top" class="arrow" title="Back to top"></a></h2>

		<p>
			Once a hierarchical or K-cluster image of interest has been identified,
			specific cluster regions may be selected for further analysis as follows:
		</p>

		<div class="cluster_instr">
			<ul>
				<li>
					<span class="list_text">
						The red box is the image cropper.
						To move the image cropper box, drag it across the image, or click on any region of the image.
					</span>
				</li>
				<li>
					<span class="list_text">
						To alter the height of the image cropper, drag the top or bottom borders of the box.
					</span>
				</li>
				<li>
					<span class="list_text">
						The image cropper can also be moved using the arrow keys on the keyboard.
						Holding down the shift key along with the arrow keys moves the box faster;
						holding down the control key moves the box slower.
					</span>
				</li>
				<li>
					<span class="list_text">
						Select additional regions of interest by clicking the "+" icon in the top right corner
						of the active image cropper box, or with "a" on the keyboard. Each selected region is numbered.
					</span>
				</li>
				<li>
					<span class="list_text">
						The "Stack selections" button opens a new window to view multiple, stacked selections.
						The image cropper box can again be used to specify region(s) of interest on the stacked image.
					</span>
				</li>
				<li>
					<span class="list_text">
						Double click the active image cropper box or hit the space bar to view an enlarged image
						of the selected cluster region with gene and sample annotation.
					</span>
				</li>
				<li>
					<span class="list_text">
						Use the "Get selected data" button to download original values and sample information
						in SOFT format for the chosen cluster region(s).
					</span>
				</li>
				<li>
					<span class="list_text">
						Use the "Plot selected gene profiles" button to view profile line plots for the chosen cluster region(s).
					</span>
				</li>
				<li>
					<span class="list_text">
						Use the "Get profiles in Entrez-GEO" button to retrieve individual gene profile charts
						and accompanying information from Entrez GEO Profiles for chosen cluster region(s).
					</span>
				</li>
			</ul>
		</div>

		<a id="input"></a>
		<h2>Data input, filtering and transformation <a href="#top" class="arrow" title="Back to top"></a></h2>

		<p>
DataSet SOFT files are used as input. DataSet SOFT files contain value measurements as originally supplied by submitters.
Filtering is based on the quality of the data; no assumptions are made on the distribution and range of the data. Missing values, negative count values, or data flagged with Affymetrix "Detection call=Absent" are not considered when calculating distance metrics and generating clusters. When more than 80% of the data points for a gene are invalid, the gene is not used in clustering.
Transformation procedures are minimized to preserve the originality of the data distribution. Single channel count values are log (base 2) transformed; dual channel log ratio values are left as is. For hierarchical clustering, gene median centering followed by sample median centering is performed once to align data before clustering. No centering is performed for K-means/K-median clustering.
		</p>

		<a id="clustering"></a>
		<h2>Clustering <a href="#top" class="arrow" title="Back to top"></a></h2>

		<h3>Distance metrics</h3>

		<p>
			The scale of the cluster measurements determines the classification performance
			(or how well a set of genes or samples are separated/clustered).
		</p>

		<p>
			Euclidean distance of gene <img alt="x variable" src="/geo/img/x.png" /> and <img class="variable" alt="y variable" src="/geo/img/y.png" />
			of <img alt="n variable" src="/geo/img/n.png" /> samples
			or sample <img alt="x variable" src="/geo/img/x.png" /> and <img class="variable" alt="y variable" src="/geo/img/y.png" />
			of <img alt="n variable" src="/geo/img/n.png" /> genes:
		</p>

		<img class="math" src="/geo/img/d_euclidean.png" alt="Formula for Euclidean distance" />

		<p>
			Pearson Correlation of gene <img alt="x variable" src="/geo/img/x.png" /> and <img class="variable" alt="y variable" src="/geo/img/y.png" />
			of <img alt="n variable" src="/geo/img/n.png" /> samples or sample
			<img alt="x variable" src="/geo/img/x.png" /> and <img class="variable" alt="y variable" src="/geo/img/y.png" /> of
			<img alt="n variable" src="/geo/img/n.png" /> genes, where
			<img alt="x-bar variable" src="/geo/img/x_bar.png" /> is the mean of <img alt="x variable" src="/geo/img/x.png" /> and
			<img class="variable" alt="y-bar variable" src="/geo/img/y_bar.png" /> is the mean of <img class="variable" alt="y variable" src="/geo/img/y.png" />:
		</p>

		<img class="math" src="/geo/img/r_pearson.png" alt="Formula for Pearson correlation" />

		<p>
			Un-centered correlation coefficient of gene <img alt="x variable" src="/geo/img/x.png" /> and <img class="variable" alt="y variable" src="/geo/img/y.png" />
			of <img alt="n variable" src="/geo/img/n.png" />
			samples or sample <img alt="x variable" src="/geo/img/x.png" /> and <img class="variable" alt="y variable" src="/geo/img/y.png" />
			of <img alt="n variable" src="/geo/img/n.png" /> genes:
		</p>

		<img class="math" src="/geo/img/r_uncentered.png" alt="Formula for un-centered correlation" />

		<h3>Hierarchical clustering</h3>

		<p>
			Unsupervised hierarchical clustering is performed on all DataSets. Euclidean, Pearson correlation, and un-centered correlation coefficient distance metrics options are available. Samples are clustered if there is no ordering of the samples in the DataSet. For ascending-order DataSets, samples are not clustered and the order is not changed. Every gene and sample used in the clustering carries the same weight.
		</p>

		<p>
			Different linkage methods affect the shape of the resulting clusters:
		</p>

		<dl>
			<dt>Single linkage:</dt>
			<dd>The linking distance is the minimum distance between two clusters.</dd>

			<dt>Complete linkage:</dt>
			<dd>The linking distance is the maximum distance between two clusters.</dd>

			<dt>Average linkage/UPGMA:</dt>
			<dd>
				The linking distance is the average of all pair-wise distances between members of the two clusters.
				Since all genes and samples carry equal weight,
				the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA).
			</dd>
		</dl>

		<h3>K-means/K-median clustering</h3>

		<p>
			The K-clustering procedure divides all genes into K number of clusters, such that the total distance of all genes to their cluster centers is minimized. Users can perform K-means or K-median clustering on any DataSet, and can define any number of K clusters from 2 to 15. Euclidean, Pearson correlation, and un-centered correlation coefficient distance metrics options are available. K clusters are non-hierarchical and they do not overlap. Sample clustering is not implemented in K-clustering.
		</p>
		<p>
The expectation-maximization algorithm first randomly assigns genes to K different groups. It then iterates to find in each group the cluster center by calculating the mean/median for all samples and reassigns each gene to the cluster with the closest center. When no more reassignment occurs, the solution is found. The program is run 3 times and the solution with the lowest total distance of all genes from their cluster centers is reported. Cluster centers are determined by calculating the mean or median of the genes over all the samples in a group. The likeliness of finding a better solution is lower if the same solution is found in higher frequency than if the reported solution is found only once.
		</p>
		<p>
			Since the initialization procedure involves random assignment of genes to each partition,
			cluster results may be different on each run. Cluster results are saved in our systems for 4 hours.
		</p>

		<a id="refs"></a>
		<h2>References and acknowledgments <a href="#top" class="arrow" title="Back to top"></a></h2>

		<p>
			We thank and acknowledge these excellent sources as basis for this work:
		</p>

		<div class="geo_info_list last">
			<ul>
				<li>
					<span class="list_text">
						<a href="http://rana.lbl.gov/EisenSoftware.htm">Michael Eisen's Cluster and TreeView program source code</a>
						(Copyright (C) 1998-2000 Stanford University) and <a href="http://rana.lbl.gov/manuals/ClusterTreeView.pdf">manual</a>.
					</span>
				</li>
				<li>
					<span class="list_text">
						Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. Cluster analysis and display of
						genome-wide expression patterns.
						<a href="http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&amp;pubmedid=9843981">Proc. Natl. Acad. Sci. USA 95, 14863-8 (1998)</a>
					</span>
				</li>
				<li>
					<span class="list_text">
						De Hoon M. J., Imoto S., Nolan J., and Miyano S.
						<a href="http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/">Open source clustering software.</a>
						Bioinformatics, Feb. (2004)
					</span>
				</li>
				<li>
					<span class="list_text">
						Hartigan, J. A. (1975). Clustering algorithms (New York,: Wiley).
					</span>
				</li>
				<li>
					<span class="list_text">
						Sokal, R. R., and Sneath, P. H. A. (1963). Principles of numerical taxonomy (San Francisco, W. H. Freeman).
					</span>
				</li>
			</ul>
		</div>

	</div>
      </div>
      <div id="last_mod">
                        Last modified: July 16, 2024</div>
      <div id="footer">
    <span class="helpbar">|<a href="https://www.nlm.nih.gov"> NLM </a>|<a href="https://www.nih.gov"> NIH </a>|<a href="mailto:geo@ncbi.nlm.nih.gov"> Email GEO </a>|<a href="/geo/info/disclaimer.html"> Disclaimer </a>|<a href="https://www.nlm.nih.gov/accessibility.html"> Accessibility </a>|<a href="https://www.hhs.gov/vulnerability-disclosure-policy/index.html"> HHS Vulnerability Disclosure </a>|
    </span>
</div>
    </div>
    <script type="text/javascript" src="https://www.ncbi.nlm.nih.gov/portal/portal3rc.fcgi/rlib/js/InstrumentOmnitureBaseJS/InstrumentNCBIBaseJS/InstrumentPageStarterJS.js"></script>
  </body>
</html>