nih-gov/www.ncbi.nlm.nih.gov/geo/info/geo2r.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>About GEO2R - GEO - NCBI</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <meta name="author" content="geo" />
    <meta name="keywords" content="NCBI, national institutes of health, nih, database, archive, central, bioinformatics,  biomedicine, geo, gene, expression, omnibus, chips, microarrays, oligonucleotide, array, sage, CGH" />
    <meta name="description" content="Gene Expression Omnibus (GEO) is a database repository of high throughput  gene expression data and hybridization arrays, chips, microarrays." />
    <meta name="ncbiaccordion" content="collapsible: true, active: false" />
    <meta name="ncbi_app" content="geo" />
    <meta name="ncbi_pdid" content="documentation" />
    <meta name="ncbi_page" content="About GEO2R" />
    <link rel="shortcut icon" href="/geo/img/OmixIconBare.ico" />
    <link rel="stylesheet" type="text/css" href="/geo/css/reset.css" />
    <link rel="stylesheet" type="text/css" href="/geo/css/nav.css" />
    <link rel="stylesheet" type="text/css" href="/geo/css/info.css" />
    <script type="text/javascript" src="/core/jig/1.15.10/js/jig.min.js"></script>
    <script type="text/javascript" src="/geo/js/dd_menu.js"></script>
    <script type="text/javascript" src="/geo/js/info.js"></script>
    <script type="text/javascript">
                    jQuery.getScript("/core/alerts/alerts.js", function () {
                        galert(['#crumbs_login_bar', 'body &gt; *:nth-child(1)'])
                    });
                </script>
    <script type="text/javascript">
                    var ncbi_startTime = new Date();
                </script>
  </head>
  <body id="info" class="geo2r">
    <div id="all">
      <div id="page">
        <div id="header">
    <div id="ncbi_logo">
        <a href="/">
            <img src="/geo/img/ncbi_logo.gif" alt="NCBI Logo" />
        </a>
    </div>
    <div id="geo_logo">
        <a href="/geo/"><img src="/geo/img/geo_main.gif" alt="GEO Logo" /></a>
    </div>
</div>
        <div id="nav_bar">
    <ul id="geo_nav_bar">
        <li><a href="#">GEO Publications</a>
            <ul class="sublist">
                <li><a href="/geo/info/GEOHandoutFinal.pdf">Handout</a></li>
                <li><a href="/pmc/articles/PMC10767856/">NAR 2024 (latest)</a></li>
                <li><a href="/pmc/articles/PMC99122/">NAR 2002 (original)</a></li>
                <li><a href="/pmc/?term=10767856,4944384,3531084,3341798,3013736,2686538,2270403,1669752,1619900,1619899,539976,99122">All publications</a></li>
            </ul>
        </li>
        <li><a href="/geo/info/faq.html">FAQ</a></li>
        <li><a href="/geo/info/MIAME.html" title="Minimum Information About a Microarray Experiment">MIAME</a></li>
        <li><a href="mailto:geo@ncbi.nlm.nih.gov">Email GEO</a></li>
    </ul>
</div>
        <div id="crumbs_login_bar"><a title="NCBI home page" href="/">NCBI</a> »
                            <a id="curr_page" title="GEO home page" href="/geo/">GEO</a> »
                            <a title="GEO documentation guide" href="/geo/info/">Info</a> »
                            <span>About GEO2R</span><span id="login_status"><a href="/geo/submitter/" title="Click here to login. You need to do this only if you want to edit the contact information, submit data, see your unreleased data, or work with data already submitted by you. You do not need to login if you are here just to browse through public holdings">Login</a></span></div>
        <div id="content">
        <a id="top"></a>
        <h1>About GEO2R</h1>

        <ul class="page_menu">
            <li><a href="#background">Background</a>
                <ul>
                    <li><a href="#rnaseq">RNA-seq data</a></li>
                    <li><a href="#microarray">Microarray data</a></li>
                </ul>
            </li>
            <li><a href="#how_to_use">How to use</a>
                <ul>
                    <li><a href="#accession">Enter a Series accession number</a></li>
                    <li><a href="#groups">Define Sample groups</a></li>
                    <li><a href="#assign">Assign Samples to each group</a></li>
                    <li><a href="#test">Perform the analysis</a></li>
                    <li><a href="#interpret">Top differentially expressed genes</a></li>
                    <li><a href="#visualization">Visualization</a></li>
                    <li><a href="#video">Tutorial video</a></li>
                </ul>
            </li>
            <li><a href="#options_features">Edit options and features</a>
                <ul>
                    <li><a href="#options">Options</a></li>
                    <li><a href="#profile_graph">Profile graph</a></li>
                    <li><a href="#r_script">R script</a></li>
                </ul>
            </li>
            <li><a href="#limitations">Limitations and caveats</a></li>
            <li><a href="#references">More information and references</a>
                <ul>
                    <li><a href="#summary_statistics">Summary statistics</a></li>
                    <li><a href="#general_references">General references</a></li>
                    <li><a href="#adjustment_references">Adjustment test references</a></li>
                </ul>
            </li>
        </ul>

        <a id="background"></a>
        <h2>Background</h2>

        <p>
            <a href="/geo/geo2r/">GEO2R</a> is an interactive web tool that allows users to compare two or more groups of Samples in a GEO Series
            in order to identify genes that are differentially expressed across experimental conditions.
            Results are presented as a table of genes ordered by P-value, and as a collection of
            graphic plots to help visualize differentially expressed genes and assess data set quality.
            GEO2R uses a variety of R packages from the <a href="https://www.bioconductor.org">Bioconductor</a> project.
            Bioconductor is an open-source software project based on the R programming language
            that provides tools for the analysis of high-throughput genomic data.
        </p>

        <a id="rnaseq"></a>
        <h3>RNA-seq data <span class="beta">BETA</span></h3>
        <p>
            GEO2R uses <em><a href="https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html">DESeq2</a></em>
            to perform differential expression analysis using
            <a href="/geo/info/rnaseqcounts.html#raw">NCBI-computed raw count matrices</a> as input.
            <em>DESeq2</em> is an R package for identifying differentially expressed genes in RNA-seq data.
            It uses negative binomial generalized linear models and has features that offer consistent
            performance over a large range of data types, making it applicable for small studies
            with few replicates as well as for large observational studies.
        </p>

        <a id="microarray"></a>
        <h3>Microarray data</h3>
        <p>
            GEO2R uses <em><a href="https://www.bioconductor.org/packages/2.8/bioc/html/GEOquery.html">GEOquery</a></em>
            and <em><a href="https://www.bioconductor.org/packages/release/bioc/html/limma.html">limma</a></em>
            to perform differential expression analysis using original submitter-supplied processed
            data tables as input. <em>GEOquery</em> parses GEO data into R data structures that can be used
            by other R packages. <em>limma</em> (Linear Models for Microarray Analysis) is a statistical
            test for identifying differentially expressed genes in microarray data.
            It handles a wide range of experimental designs and data types and applies
            multiple-testing corrections on P-values to help correct for the occurrence of false positives.
        </p>
        <p>
            <strong>IMPORTANT</strong>: GEO2R does not rely on curated DataSets and examines the Series Matrix data
            files directly. It is important to realize that this tool can access and analyze almost
            any GEO Series, regardless of data type and quality, so the user must be aware of
            GEO2R <a href="#limitations">Limitations and caveats</a>.
        </p>


        <a id="how_to_use"></a>
        <h2>How to use <a title="Back to top" class="arrow" href="#top">Back to top</a></h2>

        <a id="accession"></a>
        <h3>Enter a Series accession number</h3>

        <p>
            If you followed a link from a Series record, the GEO accession box will already be populated.
            Otherwise, enter a Series accession number in the box, e.g., GSE25724.
            If the Series is associated with multiple microarray Platforms, you will be asked to select the Platform of interest.
        </p>

        <a id="groups"></a>
        <h3>Define Sample groups</h3>

        <p>
            In the Samples panel, click 'Define groups' and enter names for the groups of Samples you plan to compare,
            e.g., <em>test</em> and <em>control</em>. Up to 10 groups can be defined.
            At least two groups must be defined in order to perform the analysis.
            Groups can be removed using the [X] feature next to the group name.
            The order in which you define the groups has a bearing on downstream results.
            For 2 group comparisons, typically it is appropriate to define the test group first,
            then define the control group - that way, the log fold change direction will follow
            convention and be positive for genes upregulated in test Samples compared to controls,
            and negative for downregulated genes.
            (Note: This change was implemented November 2020.
            You can reverse the order in which groups are created if you need to replicate a previous analysis).
        </p>

        <a id="assign"></a>
        <h3>Assign Samples to each group</h3>

        <img src="/geo/img/geo2r_sample_groups.jpg" class="geo2r_img" alt="Screenshot of GEO2R samples table" title="The Samples table which lists the Samples in the study and their descriptions. Two Sample groups are defined, 'space flown' and 'control'. Four Samples are assigned to each group." />

        <p>
            To assign Samples to a group, highlight relevant Sample rows.
            Multiple rows may be highlighted either by dragging the cursor over contiguous Samples or using Ctrl or Shift keys.
            When relevant Samples are highlighted, click the group name to assign those Samples to the group. Repeat for each group.
            Not all Samples in a Series need to be selected for the analysis to work.
        </p>
        <p>
            Use the Sample metadata columns to help determine which Samples belong to which group.
            The table is populated with Accession, Title, Source name and individual Characteristics fields from the Sample records.
            You can change which fields are displayed using the <em>Columns</em> box at the upper right corner of the table,
            and the columns can be sorted by clicking the table headers.
        </p>

        <a id="test"></a>
        <h3>Perform the analysis</h3>

        <p>
            After Samples have been assigned to groups, click the <em>Analyze</em> button to run the analysis with default parameters.
        </p>
        <p>
            Alternatively, you can edit the default analysis parameters in the <em>Options</em> tab.
            For example, you can select an alternative P-value adjustment method in the <em>Options</em> tab and click <em>Reanalyze</em>
            to run the analysis with revised parameters.
            Details regarding each edit option are provided in the <a href="#options_features">Edit options and features</a> section below.
        </p>
        <p>
            You can click the Analyze button without defining groups and retrieve
            plots that can be helpful in assessing normalization status and Sample groupings, that is,
            they can help you determine suitability of the study for further analysis and whether to apply
            any adjustments to the test.
        </p>

        <a id="interpret"></a>
        <h3>Top differentially expressed genes</h3>

        <img src="/geo/img/geo2r_results_table.jpg" class="geo2r_img" alt="Screenshot of GEO2R results table" title="The results table which lists the top 250 differentially expressed genes. The first row is clicked to reveal the gene expression profile graph for gene Rbm3." />

        <p>
            Results are presented in the browser as a table of the top 250 genes ranked by adjusted P-value
            (P-values corrected for multiple testing). For RNA-seq, the table is the result of the
            Wald test when comparing 2 groups of Samples, and LRT (Likelihood Ratio Test)
            when comparing 3 or more groups of Samples.
            Click on a row to reveal the gene expression profile graph for that gene.
            Each red bar in the graph represents the expression measurement extracted
            from the <a href="/geo/info/rnaseqcounts.html#norm">TPM normalized</a> expression counts (for RNA-seq), or the Value column
            of the original submitter-supplied Sample record (for microarrays).
            The Sample accession numbers and group names are listed along the bottom of the chart.
        </p>
        <p>
            Use the <em>Select columns</em> feature to modify which data and annotation columns are included in the table.
            Information about the meaning of the data columns is provided in the <a href="#summary_statistics">Summary statistics</a> section.
        </p>
        <p>
            If you want to edit the analysis parameters, you can do so in the <em>Options</em> tab,
            then click <em>Reanalyze</em> to apply the edits.
        </p>
        <p>
            To see more than the top 250 genes, use the <em>Download full table</em> link to download
            the entire set of results.
            The downloaded file is tab-delimited and suitable for opening in a spreadsheet application such as Excel.
        </p>

        <a id="visualization"></a>
        <h3>Visualization</h3>
        <p>
            Several graphical plots are generated to help users further explore differentially expressed genes
            and assess dataset quality. More details on the generation and usage of these plots can be found in the
            <a href="https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html">Analyzing RNA-seq data with DESeq2</a>
            vignette and the
            <a href="https://bioconductor.org/packages/2.6/bioc/vignettes/limma/inst/doc/usersguide.pdf">limma Users Guide</a>,
            as well as the GEO2R <a href="#r_script">R script</a> tab.
        </p>

        <table class="overview">
            <tbody>
                <tr>
                    <th><a href="https://en.wikipedia.org/wiki/Volcano_plot_(statistics)">Volcano plot</a></th>
                    <td><img alt="Volcano plot" src="/geo/img/geo2r_volcano.png" /></td>
                    <td>
                        A volcano plot displays statistical significance (-log10 P value) versus magnitude
                        of change (log2 fold change) and is useful for visualizing differentially expressed
                        genes. Click the <em>Explore and download</em> link to go to the interactive plot. There,
                        you can mouse-over data points to see individual gene annotation.  Highlighted genes
                        are significantly differentially expressed at a default adjusted p-value cutoff of 0.05
                        (red = upregulated, blue = downregulated). You can change the significance cut-off in the
                        <em>Options</em> tab. A volcano plot displays the test results for a single contrast
                        (a contrast is one Sample group compared to another Sample group). Thus, if you defined
                        more than 2 Sample groups in your analysis, a separate plot is generated for each contrast.
                        By default, for &gt;2 groups of Samples, the number of contrasts presented is equal to the number
                        of groups, and each group is compared to the next in the order that they were created.
                        Alternatively, you can select up to 5 custom contrasts in the <em>Options</em> tab. If more than
                        2 Sample groups are defined, use the checkboxes to toggle between contrasts. Use the
                        <em>Download significant genes</em> button to download the highlighted genes in each contrast.
                    </td>
                </tr>
                <tr>
                    <th><a href="https://en.wikipedia.org/wiki/MA_plot">Mean difference (MD) plot</a></th>
                    <td><img alt="Mean difference (MD) plot" src="/geo/img/geo2r_plotMD.png" /></td>
                    <td>
                        A mean difference (MD) plot displays log2 fold change versus average log2 expression values
                        and is useful for visualizing differentially expressed genes. Click the <em>Explore and download</em>
                        link to go to the interactive plot. There, similar to volcano plot, you can mouse-over data
                        points to see individual gene annotation. Highlighted genes are significantly differentially
                        expressed at a default adjusted P-value cutoff of 0.05 (red = upregulated,
                        blue = downregulated). You can change the significance cut-off in the <em>Options</em> tab.
                        A mean difference plot displays the test results for a single contrast
                        (a contrast is one Sample group compared to another Sample group).
                        Thus, if you defined more than 2 Sample groups in your analysis, a separate plot is generated
                        for each contrast. By default, for &gt;2 groups of Samples, the number of contrasts presented
                        is equal to the number of groups, and each group is compared to the next in the order that
                        they were created. Alternatively, you can select up to 5 custom contrasts in the <em>Options</em> tab.
                        If more than 2 Sample groups are defined, use the checkboxes to toggle between contrasts.
                        Use the <em>Download significant genes</em> button to download the highlighted genes in each contrast.
                    </td>
                </tr>
                <tr>
                    <th><a href="https://en.wikipedia.org/wiki/Dimensionality_reduction#UMAP">UMAP</a></th>
                    <td><img alt="UMAP" src="/geo/img/geo2r_umap.png" /></td>
                    <td>
                        Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique useful
                        for visualizing how Samples are related to each other. The number of nearest neighbors
                        used in the calculation is indicated in the plot. This plot can be generated without
                        Sample group selection, just click <em>Analyze</em> before defining groups.
                    </td>
                </tr>
                <tr>
                    <th><a href="https://en.wikipedia.org/wiki/Venn_diagram">Venn diagram</a></th>
                    <td><img alt="Venn diagram" src="/geo/img/geo2r_vennDiagram.png" /></td>
                    <td>
                        Use to explore and download the overlap in significant genes between multiple contrasts.
                        The genes in each region on the Venn diagram can be downloaded by selecting the relevant
                        contrasts. For example, in the Venn diagram shown here, select both
                        'healthy control vs osteoarthritis' and 'healthy control vs rheumatoid arthritis'
                        to download the 976 significant genes that are common to both contrasts,
                        but not to 'osteoarthritis vs rheumatoid arthritis'. To download all significant
                        genes for a given contrast, use the interactive volcano or MD plot pages instead.
                        <br />
                        Limitation: Data for up to 5 contrasts can be plotted. When &gt;5 groups have been defined,
                        default behavior is to show contrasts with the highest and lowest number of expressed genes.
                        Alternatively, you can select which 5 contrasts to display on the <em>Options</em> tab.

                    </td>
                </tr>
                <tr>
                    <th><a href="https://en.wikipedia.org/wiki/Box_plot">Boxplot</a></th>
                    <td><img alt="Boxplot" src="/geo/img/geo2r_boxplot.png" /></td>
                    <td>
                        Use to view the distribution of the values of the selected Samples. The Samples are colored
                        according to groups. Viewing the distribution can be useful for determining if your selected
                        Samples are suitable for differential expression analysis. Generally, median-centered values
                        are indicative that the data are normalized and cross-comparable. If that is not the case,
                        you might consider checking <em>Force normalization</em> in the <em>Options</em> tab which will apply
                        quantile normalization to the expression data making all selected Samples have identical
                        value distribution. The plot shows data after log transform and normalization, if they were
                        performed. This plot can be generated without Sample group selection, just click <em>Analyze</em>
                        before defining groups.
                    </td>
                </tr>
                <tr>
                    <th>Expression density</th>
                    <td><img alt="Expression density" src="/geo/img/geo2r_plotDensities.png" /></td>
                    <td>
                        Use to view the distribution of the values of the selected Samples. The Samples are colored
                        according to groups. This plot complements boxplot (above) in checking for data normalization
                        before differential expression analysis. If density curves greatly differ from Sample to
                        Sample, you might consider checking <em>Force normalization</em> in the <em>Options</em> tab.
                        The plot shows data after log transform and normalization if they were performed.
                        This plot can be generated without Sample group selection, just click <em>Analyze</em>
                        before defining groups.
                    </td>
                </tr>
                <tr>
                    <th>Adjusted P-value histogram</th>
                    <td><img alt="Adjusted P-value histogram" src="/geo/img/geo2r_hist.png" /></td>
                    <td>
                        Generated using <a href="https://www.rdocumentation.org/packages/graphics/topics/hist">hist</a>
                        <br />
                        Use to view the distribution of the P-values in the analysis results. The P-value here is
                        the same as in the <em>Top differentially expressed genes</em> table and computed using all
                        selected contrasts. While the displayed table is limited by size (250) this plot allows
                        you to see the 'big picture' by showing the P-value distribution for all analyzed genes.
                    </td>
                </tr>
                <tr>
                    <th><a href="https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot">Moderated t-statistic quantile-quantile (q-q) plot</a></th>
                    <td><img alt="Moderated t-statistic quantile-quantile (q-q) plot" src="/geo/img/geo2r_qqt.png" /></td>
                    <td>
                        Plots the quantiles of a data sample against the theoretical quantiles of a Student's
                        t distribution. This plot helps to assess the quality of the <em>limma</em> test results.
                        Ideally the points should lie along a straight line, meaning that the values for
                        moderated t-statistic computed during the test follow their theoretically predicted
                        distribution.
                    </td>
                </tr>
                <tr>
                    <th>Mean-variance trend</th>
                    <td><img alt="Mean-variance trend" src="/geo/img/geo2r_plotSA.png" /></td>
                    <td>
                        This plot is used to check the mean-variance relationship of the expression data,
                        after fitting a linear model. It can help show if there is a lot of variation in the data.
                        This plot can help assess whether applying the precision weights option to take mean-variance
                        trend into account is recommended. Precision weights improve accuracy of test results when
                        a strong mean-variance trend is present. The plot does not require group selection.
                        Each point represents a gene. The red line is mean-variance trend approximation that can be
                        (or already is, if precision weight option in <em>Options</em> tab is checked) taken into account
                        during differential gene expression analysis. The blue line is constant variance
                        approximation. This plot can be generated without Sample group selection, just click <em>Analyze</em>
                        before defining groups.
                    </td>
                </tr>
            </tbody>
        </table>

        <a id="video"></a>
        <h3>Tutorial Video</h3>
        <p>
            <span id="tut_video">
                <iframe width="640" height="360" src="https://www.youtube.com/embed/9RyWjzSnaE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="true"></iframe>
            </span>
        </p>

        <a id="options_features"></a>
        <h2>Edit options and features <a title="Back to top" class="arrow" href="#top">Back to top</a></h2>

        <a id="options"></a>
        <h3>Options</h3>

        <p>
            <strong>Apply adjustment to the P-values:</strong> <em>Limma</em> and <em>DESeq2</em> provides several P-value adjustment options.
            These adjustments, also called <a href="http://en.wikipedia.org/wiki/Multiple_testing_correction">multiple-testing corrections</a>,
            attempt to correct for the occurrence of false positive results.
            The <em>Benjamini &amp; Hochberg false discovery rate</em> method
            is selected by default because it provides a good balance
            between discovery of statistically significant genes and limitation of false positives.
            If you want to change the adjustment method, go to the <em>Options</em> tab and select another method.
            References for each method are provided below.
            The adjusted P-values are listed in the <em>Adj P-value</em> column of the results table.
        </p>
        <p>
            <strong>Apply log2 fold change threshold:</strong>
            If you are interested only in genes with larger log2 fold changes you can apply a log2 fold change threshold in the <em>Options</em> tab.
            The default is set to 0. When you choose a log2 fold change threshold value, only genes with log2 fold change values equal to or
            greater than the absolute value of the chosen threshold will appear as colored dots in the Volcano, Mean-difference plot and Venn diagram.
            For example, if you choose a log2 fold change threshold of 3, then only genes with log2 fold change
            greater than 3 or less than -3 will be colored red or blue, respectively. When a log2 fold change threshold has been chosen in
            the <em>Options</em> tab, the <em>Download significant genes</em> button will download only those genes
            that have passed the log2 fold change threshold.
        </p>
        <p>
            <strong>Apply log transformation to the data:</strong> (Microarray only)
            The GEO database accepts a variety of data value types, including logged and
            unlogged data. <em>Limma</em> expects data values to be in log space. To address this, GEO2R has an auto-detect feature that checks the
            values of selected Samples and automatically performs a log2 transformation on values determined not to be in log space.
            Alternatively, the user can select <em>Yes</em> to force log2 transformation, or <em>No</em> to override the auto-detect feature.
            The auto-detect feature only considers Sample values that have been assigned to a group, and applies the transformation in
            an all-or-none fashion.
        </p>
        <p>
            <strong>Apply limma precision weights (vooma):</strong> (Microarray only)
            The <a href="https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/vooma">vooma</a>
            function estimates the mean-variance relationship and uses this to compute appropriate observational-level weights.
        </p>
        <p>
            <strong>Force normalization:</strong> (Microarray only)
            This function applies quantile <a href="https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/normalizeBetweenArrays">normalization</a>
            to the expression data making all selected Samples have identical value distribution.
        </p>
        <p>
            <strong>Category of Platform annotation to display on results:</strong> (Microarray only)
            Select which category of annotation to display on results.
            Gene annotations are derived from the corresponding Platform record. Two types of annotation are possible:
        </p>
        <p>
            <em>NCBI generated</em> annotation is available for many records.
            These annotations are derived by extracting stable sequence identification information from the Platform
            and periodically querying against the Entrez Gene database to generate consistent and up-to-date annotation.
            Gene symbol and Gene title annotations are selected by default.
            Other categories of <em>NCBI generated</em> annotation include GO terms and chromosomal location information.
        </p>
        <p>
            <em>Submitter supplied</em> annotation is available for all records.
            These represent the original Platform annotations provided by the submitter.
            Note that there is a lot of diversity in the style and content of <em>submitter supplied</em> annotations
            and they may not have been updated since the time of submission.
        </p>
        <p>
            <strong>Adjusted P-value threshold:</strong> Volcano, MA and Venn plots highlight significant differentially
            expressed genes. The default adj-P-value significance level cut-off is 0.05. You can increase or reduce the
            significance level cut-off by entering a new number between 0 and 1.
        </p>
        <p>
            <strong>Volcano, MA and Venn contrasts:</strong> Volcano and MA plots display data for a single contrast
            (a contrast is one Sample group compared to another Sample group). Thus, if you defined more than 2 Sample
            groups in your analysis, a separate plot is generated for each contrast. A maximum of 5 custom contrasts
            is presented on volcano, MA and Venn plots – for studies with &gt;5 possible contrasts, you can change the
            contrast selection using the drop-down menu.
        </p>


        <a id="profile_graph"></a>
        <h3>Profile graph</h3>

        <p>
            This tab allows you to view a specific gene expression profile graph.
            For RNA-seq data, enter the gene symbol or identifier  from the GeneID column of the
            <a href="/geo/download/?format=file&amp;type=rnaseq_counts&amp;file=Human.GRCh38.p13.annot.tsv.gz">Human.GRCh38.p13.annot.tsv.gz</a>
            annotation file. For microarray data, use the identifier from the ID column of the corresponding
            Platform record. Each red bar in the graph represents the expression measurement extracted from
            the <a href="/geo/info/rnaseqcounts.html#norm">TPM normalized</a> expression counts (for RNA-seq),
            or the Value column of the original submitter-supplied Sample record (for microarrays).
            This feature does not perform any calculations; it merely displays the expression values of the gene
            across Samples. Sample groups do not need to be defined for this feature to work.
        </p>

        <a id="r_script"></a>
        <h3>R script</h3>

        <p>
            This tab prints the R script used to perform the calculation.
            This information can be saved and used as a reference for how results were calculated.
        </p>

        <a id="limitations"></a>
        <h2>Limitations and caveats <a title="Back to top" class="arrow" href="#top">Back to top</a></h2>

        <p>
            The GEO database is a public repository that archives thousands of original
            high-throughput functional genomic studies submitted by the scientific community.
            These studies represent a large diversity of experimental types and designs,
            and contain data that are processed and normalized using a wide variety of methods.
            GEO2R can access and analyze almost any GEO Series, regardless of data type and quality,
            so the user must be aware of the following limitations and caveats.
        </p>
        <p>
            <strong>Results may not match publication</strong>:
            RNA-seq data can be processed using many different software packages, parameter settings and filters
            and the counts and comparisons generated by the
            <a href="/geo/info/rnaseqcounts.html">NCBI RNA-seq counts pipeline</a> and GEO2R may not match results
            in the accompanying publication. The NCBI pipeline represents just one of many possible processing
            approaches. It is likely the original submitter used different procedures to process their data,
            which can lead to somewhat different expression results from those generated by the NCBI pipeline.
        </p>
        <p>
            <strong>Missing Samples</strong>:
            Reasons for missing RNA-seq count data include the run didn't pass the 50% alignment rate in the
            <a href="/geo/info/rnaseqcounts.html">NCBI RNA-seq counts pipeline</a> or processing failed for a
            technical reason. Reasons for missing microarray values include the submitter was unable to produce
            data for a given Sample. Regardless of the reason, Samples for which no counts are available are
            greyed out and can’t be selected for comparison in the Sample table.
        </p>
        <p>
            <strong>Check that Sample values are comparable</strong>:
            GEO submitters often deposit more than one type of sequence data (eg, RNA-seq and RIP-seq) in the same study,
            meaning that the RNA counts, even within a matrix, are not directly comparable.
            Other times, although Samples are of the same type, they still may not be intended for comparison.
            Review the original records to determine if all the Samples within a study are intended to be compared
            directly. Similarly, for microarray data, GEO2R operates on Series Matrix files
            which contain data extracted directly from the VALUE column of Sample tables.
            Submitters are asked to supply normalized data in the VALUE column, rendering the Samples cross-comparable.
            The majority of GEO microarray data do conform to this rule. GEO applies no further processing other than to perform a log2 transformation on values
            determined not to be in log space (see <a href="#options">Options</a> section).
            However, some studies, such as dual channel loop design data, may generate values that do not have a
            common reference and are not directly comparable. Some studies may contain Sample value data that are not normalized,
            or have a design such that the Samples were never intended to be directly compared.
            Yet other studies do not have sufficient replicate Samples to perform a robust statistical analysis.
            Users should examine the original Series to understand the experimental design,
            and check the 'Data processing' field or VALUE description in the original Sample records for information on what the values represent.
            Several plots, including boxplot and expression density can be generated without Sample group selection,
            just click <em>Analyze</em> before defining groups. These plots can help users assess whether the distributions
            of values across Samples are normalized and cross-comparable.
        </p>
        <p>
            <strong>Data type restriction:</strong> (Microarray) GEO2R operates on data in Series Matrix files
            which contain data extracted directly from the VALUE column of Sample tables.
            Some categories of GEO Samples do not have data tables
            (e.g., high-throughput sequencing or genome tiling arrays) and thus cannot be analyzed using GEO2R.
        </p>
        <p>
            <strong>Contrast selection:</strong> When more than two Sample groups are defined,
            GEO2R selects pairwise contrasts in a circular fashion (eg, 1 vs 2; 2 vs 3, 3 vs 4).
            Thus, the top differentially expressed genes presented in the results table may not
            fully reflect the user expectation of all possible pairwise contrasts.
        </p>
        <p>
            <strong>Within-Series restriction:</strong> GEO2R operates on Series Matrix files.
            Thus, analyses are restricted to Samples that occur within one Series; it is not possible to perform cross-Series comparisons.
        </p>
        <p>
            <strong>Failed jobs:</strong> Occasionally, a GEO2R analysis will fail because some aspect of the input data
            is not compatible with the <em>GEOquery</em>, <em>limma</em>, or <em>DESeq2</em> packages.
            In such cases, native BioConductor errors are reported.
        </p>
        <p>
            <strong>10 minute timeout:</strong> GEO2R currently has a 10 minute cutoff imposed on job processing.
            If the Series you are attempting to analyze has a large number of Samples and/or genes,
            the analysis may not run to completion.
        </p>

        <a id="references"></a>
        <h2>More information and references <a title="Back to top" class="arrow" href="#top">Back to top</a></h2>

        <a id="summary_statistics"></a>
        <h3>Summary statistics</h3>

        <strong>RNA-seq:</strong>

        <p>
            GEO2R provides the following summary statistics as generated by <em>DESeq2</em>. GEO2R uses the
            Wald test for comparing 2 groups of Samples, and LRT (Likelihood Ratio Test) when comparing
            3 or more groups of Samples.
        </p>

        <table class="overview">
            <tbody>
                <tr>
                    <th>padj</th>
                    <td>
                        P-value after adjustment for multiple testing. This column is generally recommended as the primary
                        statistic by which to interpret results.
                    </td>
                </tr>
                <tr>
                    <th>pvalue</th>
                    <td>
                        Raw P-value.
                    </td>
                </tr>
                <tr>
                    <th>lfcSE</th>
                    <td>
                        Standard error of the log2FoldChange estimate (only available when two groups of Samples are defined).
                    </td>
                </tr>
                <tr>
                    <th>stat</th>
                    <td>
                        The Wald statistic (for a two group comparison), or the difference in deviance between the reduced model
                        and the full model (for &gt;2 group comparison).
                    </td>
                </tr>
                <tr>
                    <th>Log2FoldChange</th>
                    <td>
                        Log2-fold change between two experimental conditions (only available when two groups of Samples are defined).
                    </td>
                </tr>
                <tr>
                    <th>baseMean</th>
                    <td>
                        The average of the normalized counts taken over all Samples.
                    </td>
                </tr>
            </tbody>
        </table>

        <strong>Microarray:</strong>

        <p>
            GEO2R provides the following summary statistics as generated by the <em>limma</em> topTable function.
            More information about each statistic is provided in chapter 10 of the
            <a href="http://bioconductor.org/packages/2.6/bioc/vignettes/limma/inst/doc/usersguide.pdf">limma users guide</a>.
        </p>

        <table class="overview">
            <tbody>
                <tr>
                    <th>adj.P.Val</th>
                    <td>
                        P-value after adjustment for multiple testing. This column is generally recommended as the
                        primary statistic by which to interpret results.
                    </td>
                </tr>
                <tr>
                    <th>P.Value</th>
                    <td>
                        Raw P-value.
                    </td>
                </tr>
                <tr>
                    <th>t</th>
                    <td>
                        Moderated t-statistic (only available when two groups of Samples are defined).
                    </td>
                </tr>
                <tr>
                    <th>B</th>
                    <td>
                        B-statistic or log-odds that the gene is differentially expressed (only available when two groups of Samples are defined).
                    </td>
                </tr>
                <tr>
                    <th>logFC</th>
                    <td>
                        Log2-fold change between two experimental conditions (only available when two groups of Samples are defined).
                    </td>
                </tr>
                <tr>
                    <th>F</th>
                    <td>
                        Moderated F-statistic combines the t-statistics for all the pair-wise comparisons into an overall test of significance for that gene
                        (only available when more than two groups of Samples are defined).
                    </td>
                </tr>
            </tbody>
        </table>

        <a id="general_references"></a>
        <h3>General references</h3>

        <ul class="citation">
            <li>
                Love, M. I., Huber, W., Anders, S.
                Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
                Genome Biol. 2014;15(12):550.
            </li>
            <li>
                Love, M. I., Anders, S., Huber, W.
                R documentation: <a href="https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html">Analyzing RNA-seq data with DESeq2</a>.
            </li>
            <li>
                Smyth, G. K. (2004).
                Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.
                Statistical Applications in Genetics and Molecular Biology, Vol. 3, No. 1, Article 3.
            </li>
            <li>
                Smyth, G. K. (2005).
                Limma: linear models for microarray data.
                In: Bioinformatics and Computational Biology Solutions using R and Bioconductor,
                R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds.), Springer, New York, pages 397-420.
            </li>
            <li>
                Sean Davis and Paul S. Meltzer (2007).
                GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.
                <em>Bioinformatics</em> 23(14): 1846-1847
            </li>
            <li>
                R documentation: <a href="http://rss.acs.unt.edu/Rdoc/library/limma/html/toptable.html">Table of Top Genes from Linear Model Fit</a>
            </li>
        </ul>

        <a id="adjustment_references"></a>
        <h3>Adjustment test references</h3>

        <ul class="citation">
            <li>
                R documentation: <a href="http://rss.acs.unt.edu/Rdoc/library/stats/html/p.adjust.html"> Adjust P-values for Multiple Comparisons </a>
            </li>
            <li>
                Benjamini, Y., and Hochberg, Y. (1995).
                Controlling the false discovery rate: a practical and powerful approach to multiple testing.
                <em>Journal of the Royal Statistical Society Series B</em>, 57, 289-300.
            </li>
            <li>
                Benjamini, Y., and Yekutieli, D. (2001).
                The control of the false discovery rate in multiple testing under dependency.
                <em>Annals of Statistics</em> 29, 1165-1188.
            </li>
            <li>
                Holm, S. (1979).
                A simple sequentially rejective multiple test procedure.
                <em>Scandinavian Journal of Statistics</em>, 6, 65-70.
            </li>
            <li>
                Hommel, G. (1988).
                A stagewise rejective multiple test procedure based on a modified Bonferroni test.
                <em>Biometrika</em>, 75, 383-386.
            </li>
            <li>
                Hochberg, Y. (1988).
                A sharper Bonferroni procedure for multiple tests of significance.
                <em>Biometrika</em>, 75, 800-803.
            </li>
            <li>
                Shaffer, J. P. (1995).
                Multiple hypothesis testing.
                <em>Annual Review of Psychology</em>, 46, 561-576.
            </li>
            <li>
                Sarkar, S. (1998).
                Some probability inequalities for ordered MTP2 random variables: a proof of Simes conjecture.
                <em>Annals of Statistics</em>, 26, 494-504.
            </li>
            <li>
                Sarkar, S., and Chang, C. K. (1997).
                Simes' method for multiple hypothesis testing with positively dependent test statistics.
                <em>Journal of the American Statistical Association</em>, 92, 1601-1608.
            </li>
            <li>
                Wright, S. P. (1992).
                Adjusted P-values for simultaneous inference.
                <em>Biometrics</em>, 48, 1005-1013.
            </li>
        </ul>

        <p>
        </p>
    </div>
      </div>
      <div id="last_mod">
                        Last modified: July 16, 2024</div>
      <div id="footer">
    <span class="helpbar">|<a href="https://www.nlm.nih.gov"> NLM </a>|<a href="https://www.nih.gov"> NIH </a>|<a href="mailto:geo@ncbi.nlm.nih.gov"> Email GEO </a>|<a href="/geo/info/disclaimer.html"> Disclaimer </a>|<a href="https://www.nlm.nih.gov/accessibility.html"> Accessibility </a>|<a href="https://www.hhs.gov/vulnerability-disclosure-policy/index.html"> HHS Vulnerability Disclosure </a>|
    </span>
</div>
    </div>
    <script type="text/javascript" src="https://www.ncbi.nlm.nih.gov/portal/portal3rc.fcgi/rlib/js/InstrumentOmnitureBaseJS/InstrumentNCBIBaseJS/InstrumentPageStarterJS.js"></script>
  </body>
</html>