nih-gov/orio.niehs.nih.gov/help/index.html

511 lines
22 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv='X-UA-Compatible' content='IE=Edge'></meta>
<meta charset="utf-8"></meta>
<meta name='viewport' content='width=device-width, initial-scale=1.0'></meta>
<link rel="shortcut icon" href="//www.niehs.nih.gov/resources/favicons/www/fav-57.png"/>
<link rel="apple-touch-icon" sizes="57x57" href="//www.niehs.nih.gov/resources/favicons/www/fav-57.png">
<link rel="apple-touch-icon" sizes="72x72" href="//www.niehs.nih.gov/resources/favicons/www/fav-72.png">
<link rel="apple-touch-icon" sizes="114x114" href="//www.niehs.nih.gov/resources/favicons/www/fav-114.png">
<link rel="apple-touch-icon" sizes="144x144" href="//www.niehs.nih.gov/resources/favicons/www/fav-144.png">
<title>ORIO | Manual</title>
<link rel='stylesheet' type='text/css'
href='//cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.5/css/bootstrap.min.css'>
<link rel="stylesheet" type="text/css"
href="//cdnjs.cloudflare.com/ajax/libs/font-awesome/4.4.0/css/font-awesome.min.css">
<link rel="stylesheet" type="text/css"
href="//cdnjs.cloudflare.com/ajax/libs/toastr.js/2.1.2/toastr.css">
<link rel="stylesheet" type="text/css" media="all" href="/static/css/site.css">
<script async id="_fed_an_ua_tag" charset="utf-8"
src="/static/js/ufa.min.js?agency=HHS&subagency=NIH"></script>
</head>
<body id="" class="">
<div id='wrap'>
<div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div id="content" class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href='/'>ORIO</a>
</div>
<div class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li><a href="mailto:orio@niehs.nih.gov?subject=feedback">
<i class="fa fa-fixed fa-envelope-o"></i>&nbsp;Contact us</a></li>
</ul>
<ul class="nav navbar-nav navbar-right">
<li><a href="/quickstart/">Getting started</a></li>
<li><a href="/help/">Help</a></li>
<li><a href="/accounts/login/">Log in</a></li>
</ul>
</div>
</div>
</div>
<div id="content" class="container-fluid">
<div id="mainContent" class="container">
<h2>ORIO help</h2>
<p>
ORIO (Online Resource for Integrative Omics) is an analysis platform for data
from next generation sequencing (NGS). ORIO enables rapid analysis and
integration of NGS data sets. ORIO was designed based on three central
observations:
</p>
<ol>
<li>
Diverse biological phenomena may be represented by discrete positions in
genomic space. Think protein binding sites for transcription factor
regulation or transcription start sites for transcription initiation.
</li>
<li>
Despite a wide diversity of NGS experiment and data types, analysis of
NGS data often involves consideration and manipulation of genomic read
coverage.
</li>
<li>
Visual inspection remains a critical component of analysis.
</li>
</ol>
<p>
The bulk of analysis is performed using the
<a href="https://github.com/NIEHS/orio">ORIO analysis package</a>
<span class="glyphicon glyphicon-new-window"></span>
. An ORIO
analysis run consists of two steps. First, the intersections between a feature
list of genomic coordinates and a number of NGS data sets are found. Second, the
NGS data sets are correlated based on these intersection values. The output of
these steps may be dynamically visualized using
<a href="https://github.com/NIEHS/orio-web">ORIO-web</a>
<span class="glyphicon glyphicon-new-window"></span>
.
</p>
<img src="/static/img/orio_doc.png">
<p>
ORIO has been published in
<a href="https://doi.org/10.1093/nar/gkx270">Lavender et al. 2017</a>
<span class="glyphicon glyphicon-new-window"></span>
. To cite in your publications:
</p>
<p>
Lavender CA, Shapiro AJ, Burkholder AB, Bennett BD, Adelman K, Fargo DC. ORIO
(Online Resource for Integrative Omics): a web-based platform for rapid
integration of next generation sequencing data. Nucleic Acids Res. 2017 Jun 2;
45 (10): 5678-5690. doi: 10.1093/nar/gkx270.
</p>
<h3>Data intersection</h3>
<p>
The intersection of a feature list is iteratively found for each NGS dataset in
an analysis. This intersection describes the overlap of read coverage from the
NGS data across genomic windows anchored on feature list positions.
</p>
<img src="/static/img/matrix_py_doc.png">
<p>
ORIO focuses its analysis on a list of genomic coordinates selected called a
feature list. This feature list may be uploaded as a BED file (hyperlink), or
the user may select from genomic feature lists hosted by ORIO. Analysis is
performed considering genomic windows about each feature. Dimensions of the
windows may be adjusted using the bin start, bin number, and bin size
parameters when setting up an analysis.
</p>
<p>
ORIO iteratively finds the intersection of selected NGS datasets with the
genomic feature list. The reads intersecting with each feature window are found
for each dataset. Datasets may be uploaded as read coverage bigwig files
(hyperlink). If stranded data is being considered, two separate bigwig files
corresponding to forward and reverse strands may be used. Alternatively, the
user may select from hosted datasets taken from the first production run of
ENCODE.
</p>
<p>
ORIO is able to find data intersections considering strand information. If
strand information is included in the associated BED file, read coverage will be
found respecting the strand of each feature: areas downstream of a feature will
be given higher values while areas upstream will be given lower values. If the
NGS data is stranded (i.e. forward and reverse strand bigWigs are available),
then only coverage on the same strand of a stranded feature will be considered.
</p>
<p>
The product of the data intersection is a two-dimensional matrix, where each row
corresponds to a genomic feature and each column corresponds to a bin of the
genomic window. The user can download these files through the Download zip
button on an analysis page; the Download zip command allows the user to access
any pertinent data relevant to an analysis. Matrices generated in the data
intersection step are then used in the correlative analysis step.
</p>
<h3>Correlative analysis</h3>
<p>
Using matrices generated in the data intersection step, ORIO then performs
correlative analysis based on compiled read coverage values. NGS datasets and
genomic features are grouped by hierarchical clustering and k-means clustering,
respectively. Associations discovered through clustering can implicate important
coordination of biological functions.
</p>
<img src="/static/img/matrixByMatrix_py_doc.png">
<p>
For each NGS dataset, there is a matrix of coverage values for each genomic
feature in an analysis. For each dataset pair, the Spearman correlation value is
found considering coverage values at each feature; the coverage value used is
the sum of coverage across all bins in a genomic window. Hierarchical clustering
is performed considering Spearman rho values as the pairwise distance metric.
</p>
<p>
To cluster genomic features, the total read coverage in a genomic window for
each NGS dataset is concatenated to give a one-dimensional data vector for each
feature in an analysis. These vectors are normalized by the variance in each
dataset. For each pair of features, the Euclidean distance is found considering
these normalized data vectors. k-means clustering is performed observing these
distances iteratively with k-values from 2 to 10. Clustering values for each k
are saved for future display.
</p>
<p>
Though read coverage is informative for many genomics experiments, in some NGS
experiments specialized analytical techniques must be applied to read coverage
in order to generate useful data metrics. Also, many non-NGS approaches are
relevant for genomics analysis. Acknowledging this, ORIO allows the user to
provide a single data value for each genomic feature to be used in correlative
analysis of independent NGS datasets. We call this data set the sort vector. A
sort vector may be provided at the onset of analysis in the form of a two-column
tab-delimited text file where the first column contains feature names and the
second contains data values.
</p>
<p>
If a sort vector is used, hierarchical clustering is performed focused on the
sort vector. Read coverage values for each NGS dataset are correlated with data
values in the sort vector by Spearman test. These correlation values are found
for read coverages in each genomic window bin. For each dataset, correlation
values for each bin are concatenated into a one-dimensional vector. For each
dataset pair, the Euclidean distance between these data vectors is found, and
the Euclidean distance is used as the distance metric in hierarchical
clustering. k-means clustering is performed the same in analyses with and
without a sort vector.
</p>
<p>
Correlative analysis results are stored for access and display by the web
application ORIO-web.
</p>
<h3>Data management and display of results</h3>
<p>
ORIO-web is a web application designed to maintain and organize data for
analysis by ORIO. ORIO-web also provides dynamic visualization of ORIO results.
Together ORIO and ORIO-web allow for fast, flexible, and informative integration
of whole-genome data with an intuitive web interface.
</p>
<h3>Account management</h3>
<p>
The ORIO-web landing page asks a user to generate an account associated with an
email address. All data and analyses managed by ORIO-web are associated with a
user account. Most data is privately associated with a user account; however,
ORIO-web does allow individual analyses to designated as public, allowing for
rapid sharing of results by URL address.
</p>
<h3>Data management</h3>
<p>
ORIO_web manages inputs for the ORIO analysis package. Feature lists, NGS data
sets, and sort vectors are associated with a given user account.
</p>
<p>
Data management controls are found by clicking on the 'Manage data' link button.
On the 'Data management' page, headers designate the 'Feature lists', 'Sort
vectors', and 'User dataset' sections. Data may be deleted or modified by
clicking on entries under each header, or new entries may be created by clicking
on 'Create new' buttons.
</p>
<p>
When creating new entries, each data type requires a name, an associated genome
assembly, and correctly formatted data set. Feature lists may be specified as
stranded; if so, strand must be specified for each entry in the associated BED
file in the sixth column. Sort vectors must be associated with an existing
feature list, and that feature list must be specified upon creation.
</p>
<p>
NGS data sets are uploaded to the tool as read coverage bigWig files. Given the
large size of these files, we require these files to be hosted by user and be
publicly accessible by HTTP download. When creating a data set entry, the user
must provide a valid URL for HTTP access.
</p>
<h3>Analysis management</h3>
<p>
Completed and pending analyses are presented on the ORIO-web dashboard.
</p>
<ul>
<li>
<b>Create analysis.</b> An analysis can be created by clicking on the
'Proceed to run setup' button. An analysis requires a name, genome
assembly, and feature list. Upon selecting a genome assembly, drop-down
menus for feature list, sort vector, and user-uploaded data sets are
populated. Also upon selecting a genomic assembly, ENCODE data selection
fields will be populated. ENCODE data selection was designed to navigate
through the diverse data generated by the ENCODE project. Fields such as
'Data type', 'Cell type', and 'Antibody' may be used to quickly filter
down all ENCODE data sets to a list passing filter criteria. The user
may then select individual data from this filter list.
</li>
<li>
<b>Execute analysis.</b> After all fields and options are specified, the
analysis may be saved. Upon saving, the analysis will be subject to a
validation step. Following validation, the analysis may be started by
clicking the 'Execute' button on the analysis page. Upon completion of
analysis, a message will be sent by email to the user.
</li>
<li>
<b>Modify existing analysis.</b> An analysis may be modified from the
dashboard by clicking on a completed or pending analysis and selecting
'Update' from the 'Actions' drop-down on the analysis page. From there,
the analysis parameters may be modified. An analysis may also be deleted
by selecting 'Delete' from the 'Actions' drop-down menu and confirming
the selection.
</li>
</ul>
<h3>Analysis visualization</h3>
<p>
ORIO-web provides an intuitive interface for investigating analysis results. The
visualization interface may be accessed for a completed analysis by selecting
that analysis on the dashboard and clicking 'View visualization' on the analysis
page. The results of an ORIO analysis may also be downloaded as a zip file by
selecting 'Download zip' from the 'Actions' drop-down on an analysis page.
</p>
<h4>Dataset clustering, without a sort vector.</h4>
<ul>
<li>
Data sets were hierarchically clustered based on Spearman rho values.
Clustering results are shown as a dendrogram on the left side of the top
panel. Rho values are reported by color in an n-by-n heatmap, where n is
the number of data sets. Rho values may also be found in tooltips when
hovering over individual cells. By clicking on a cell, a scatterplot
will be generated showing the points used to derive the Spearman rho
value. A drop-down menu allows for individual values to be investigated
on a bin-by-bin basis.
</li>
<li>
In the bottom panel, individual data sets may be selected in the list on
the left. Once selected, the bar plot on the right will be populated
with pairwise Spearman correlation values for each other data set. After
clicking on Display individual heatmap, a window will pop up detailing
the read coverage for that data set over the feature list.
</li>
<li>
In the pop up, a heatmap of read coverage over the user-specified
genomic window is shown on the right. In the upper-left panel, a plot of
bin-average read coverage is shown. In the mid-left panel, a plot of
bin-average read coverage over quartiles is shown. Quartiles are
generated respecting the sort order of the read coverage heatmap. The
sort order of heatmap may be changed using the lower-left panel. By
selecting a data set and clicking Reorder heatmap the heatmap will be
re-ordered to reflect read coverage of the selected data set in
descending order, ie genomic features with greater read coverage in the
selected data set will be on top. The quartile plot will change upon
re-ordering of the read coverage heatmap. The p-value in the upper-left
corner of the quartile plot is derived from application of the
four-sample Anderson-Darling test to the quartile plots and reflects the
null hypothesis that quartiles are sampled from populations that are
identical.
</li>
</ul>
<h4>Dataset clustering, with a sort vector.</h4>
<ul>
<li>
Data sets were hierarchically clustered. For each data set, the read
coverage sum across each bin found. Then, for each given bin, the
Spearman rho value is found between the bin read coverage sums and the
sort vector. For each data set, these correlation values are
concatenated in a single data vector. The data sets are hierarchically
clustered using the pairwise Euclidean distance between each data set.
Rho values are displayed by color gradient in a n-by-m heatmap, where n
is the number of data sets and m is the number of genomic bins. By
clicking on a cell, a scatterplot will be generated showing the points
used to the derive the Spearman rho value.
</li>
<li>
In the bottom panel, individual data sets may be selected in the list on
the left. Once selected, the bar plot on the right will be populated
with Spearman correlation values for each genomic bin. After clicking on
Display individual heatmap, a window will pop up detailing the read
coverage for that data set over the feature list.
</li>
<li>
In the pop up, a heatmap of read coverage over the user-specified
genomic window is shown on the right. In the upper-left panel, a plot of
bin-average read coverage is shown. In the mid-left panel, a plot of
bin-average read coverage over quartiles is shown. Quartiles are
generated respecting the sort order of the read coverage heatmap. The
sort order of heatmap may be changed using the lower-left panel. By
selecting a data set and clicking Reorder heatmap the heatmap will be
re-ordered to reflect read coverage of the selected data set in
descending order, i.e. genomic features with greater read coverage in
the selected data set will be on top. The quartile plot will change upon
re-ordering of the read coverage heatmap. The p-value in the upper-left
corner of the quartile plot is derived from application of the
four-sample Anderson-Darling test to the quartile plots and reflects the
null hypothesis that quartiles are sampled from populations that are
identical.
</li>
</ul>
<h4>Feature clustering.</h4>
<ul>
<li>
Genomic features are clustered using k-means clustering. For each
genomic feature, sum of the read coverage for each data set is found.
These sums are then normalized such that each value is in terms of units
variance. These normalized sums are concatenated into data vectors for
each genomic feature. k-means clustering is then performed on these data
vectors. Centroids are initialized by randomly selecting individual data
vectors. k-means clustering is iteratively performed for k values 2 to
10.
</li>
<li>
In the 'Feature clustering' view, clustering results are shown on the
heatmap in the right panel. Here, each row corresponds to a genomic
feature, and each column corresponds to a data set. In each cell, the
color represents the read coverage at a genomic feature for a data set
after upper-quartile normalization. Columns are ordered based on
hierarchical clustering results with a dendrogram at the top of the
panel. Bars on the left side of the panel reflect cluster membership.
</li>
<li>
In the left panel, k values may be selected from a drop-down list.
Members of the selected cluster are displayed in a list at the bottom of
the left panel. If selected, a genomic feature will be indicated in the
heatmap by a black arrow. Also, the values of the selected genomic
feature will be displayed on the centroid chart in the bottom panel.
</li>
<li>
In the bottom panel, a two-dimensional plot displays read coverage
values for cluster centroids. Values are upper-quartile normalized. If a
genomic feature is selected in the upper panel, read coverage values for
that feature will be plot as a black line.
</li>
</ul>
</div>
</div>
</div>
<footer class="footer">
<ul id="footer-links">
<li>
<a href="http://www.niehs.nih.gov/" target="_blank">NIEHS</a></li>
<li>
<a href="http://www.niehs.nih.gov/about/od/ocpl/policies/" target="_blank">Web Policies</a></li>
<li>
<a href="http://www.niehs.nih.gov/about/od/ocpl/foia" target="_blank">Freedom of Information Act</a></li>
<li>
<a href="http://oig.hhs.gov/" target="_blank">Inspector General</a></li>
</ul>
<div id="footer-logos">
<a href="http://www.usa.gov/" target="_blank">
<img src="/static/img/usagov.png"
alt="USA.gov is the U.S. government's official web portal to all federal, state, and local government web resources and services"
title="USA.gov: Government Made Easy"></a>
<a href="http://www.hhs.gov/" target="_blank">
<img src="/static/img/hhs.png"
alt="U.S. Department of Health and Human Services"
title="U.S. Department of Health and Human Services"></a>
<a href="http://www.nih.gov/" target="_blank">
<img src="/static/img/nihMasterLogo.png"
alt="U.S. National Institutes of Health"
title="U.S. National Institutes of Health"></a>
</div>
</footer>
<script charset="utf-8"
src="//cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
<script charset="utf-8"
src="//cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.3.5/js/bootstrap.min.js"></script>
<script charset="utf-8"
src="//cdnjs.cloudflare.com/ajax/libs/toastr.js/2.1.2/toastr.min.js"></script>
<script charset="utf-8" src="/static/js/site.js"></script>
<script type="text/javascript">
toastr.options = {
closeButton: true,
newestOnTop: true,
positionClass: 'toast-top-right',
showDuration: 500,
hideDuration: 500,
timeOut: 0,
extendedTimeOut: 0,
};
window.setInterval(function(){
$.get('/dashboard/poll-messages/', function(d){
if(d.messages.length>0){
toastr.clear();
}
d.messages.forEach(function(resp){
toastr[resp.status](resp.message);
});
});
}, 60000);
</script>
</body>
</html>