nih-gov/www.ncbi.nlm.nih.gov/geo/info/soft.html

939 lines
55 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>SOFT file format and content - GEO - NCBI</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="author" content="geo" />
<meta name="keywords" content="NCBI, national institutes of health, nih, database, archive, central, bioinformatics, biomedicine, geo, gene, expression, omnibus, chips, microarrays, oligonucleotide, array, sage, CGH" />
<meta name="description" content="Gene Expression Omnibus (GEO) is a database repository of high throughput gene expression data and hybridization arrays, chips, microarrays." />
<meta name="ncbiaccordion" content="collapsible: true, active: false" />
<meta name="ncbi_app" content="geo" />
<meta name="ncbi_pdid" content="documentation" />
<meta name="ncbi_page" content="SOFT file format and content" />
<link rel="shortcut icon" href="/geo/img/OmixIconBare.ico" />
<link rel="stylesheet" type="text/css" href="/geo/css/reset.css" />
<link rel="stylesheet" type="text/css" href="/geo/css/nav.css" />
<link rel="stylesheet" type="text/css" href="/geo/css/info.css" />
<script type="text/javascript" src="/core/jig/1.15.10/js/jig.min.js"></script>
<script type="text/javascript" src="/geo/js/dd_menu.js"></script>
<script type="text/javascript" src="/geo/js/info.js"></script>
<script type="text/javascript">
jQuery.getScript("/core/alerts/alerts.js", function () {
galert(['#crumbs_login_bar', 'body &gt; *:nth-child(1)'])
});
</script>
<script type="text/javascript">
var ncbi_startTime = new Date();
</script>
</head>
<body id="info" class="soft">
<div id="all">
<div id="page">
<div id="header">
<div id="ncbi_logo">
<a href="/">
<img src="/geo/img/ncbi_logo.gif" alt="NCBI Logo" />
</a>
</div>
<div id="geo_logo">
<a href="/geo/"><img src="/geo/img/geo_main.gif" alt="GEO Logo" /></a>
</div>
</div>
<div id="nav_bar">
<ul id="geo_nav_bar">
<li><a href="#">GEO Publications</a>
<ul class="sublist">
<li><a href="/geo/info/GEOHandoutFinal.pdf">Handout</a></li>
<li><a href="/pmc/articles/PMC10767856/">NAR 2024 (latest)</a></li>
<li><a href="/pmc/articles/PMC99122/">NAR 2002 (original)</a></li>
<li><a href="/pmc/?term=10767856,4944384,3531084,3341798,3013736,2686538,2270403,1669752,1619900,1619899,539976,99122">All publications</a></li>
</ul>
</li>
<li><a href="/geo/info/faq.html">FAQ</a></li>
<li><a href="/geo/info/MIAME.html" title="Minimum Information About a Microarray Experiment">MIAME</a></li>
<li><a href="mailto:geo@ncbi.nlm.nih.gov">Email GEO</a></li>
</ul>
</div>
<div id="crumbs_login_bar"><a title="NCBI home page" href="/">NCBI</a> »
<a id="curr_page" title="GEO home page" href="/geo/">GEO</a> »
<a title="GEO documentation guide" href="/geo/info/">Info</a> »
<span>SOFT file format and content</span><span id="login_status"><a href="/geo/submitter/" title="Click here to login. You need to do this only if you want to edit the contact information, submit data, see your unreleased data, or work with data already submitted by you. You do not need to login if you are here just to browse through public holdings">Login</a></span></div>
<div id="content">
<a name="top" id="top"></a>
<h1>SOFT file format and content</h1>
<ul class="doc_list">
<li><a href="#overview">Overview</a></li>
<li>
<a href="#format">SOFT format structure and content</a>
<ul>
<li><a href="#attributes">Attribute definitions</a></li>
<li><a href="#ptable">Platform data table content</a></li>
<li><a href="#stable">Sample data table content</a></li>
</ul>
</li>
<li><a href="#examples">SOFT file examples </a></li>
<li><a href="#download">SOFT download</a></li>
</ul>
<a name="overview" id="overview"></a>
<h2>Overview <a class="arrow" title="Back to top" href="#top">Back to top</a></h2>
<p>
Simple Omnibus Format in Text (SOFT) is a simple line-based, plain text format, that was originally developed
by GEO for data submissions, updates, and downloads. GEO discontinued the use of SOFT format for data
submissions and updates in early 2024, but continues to make all records available for download in SOFT format.
A single SOFT file can hold both data tables and accompanying descriptive information for multiple,
concatenated Platforms (GPL records), Samples (GSM records), and/or Series (GSE records), and the format can be
programmatically-accessed or opened in common spreadsheet and database applications.
</p>
<p>
This document was originally written to guide submitters on how to construct a SOFT file for the purpose of
submitting to GEO, but now serves only as a guide to users on the structure and content of SOFT files downloaded
from GEO. As such, some of the information provided is no longer applicable (i.e. any information pertaining to
submission requirements or recommendations is not applicable).
</p>
<p>
<a href="#examples">Examples of SOFT files</a> are available to view.
</p>
<a name="format" id="format"></a>
<h2>SOFT format structure and content <a class="arrow" title="Back to top" href="#top">Back to top</a></h2>
<p>The following section explains the components and structure of a SOFT file.</p>
<ul class="geo_doc_list">
<li>
<span>
<b>Line-type characters</b>: There are four different types of line that are recognized in SOFT.
The presence of any one of three characters in the first character position in the line indicates
three of the line types, and the absence of any of these indicates the fourth line type.
The four line-type characters and descriptions of what they indicate are:
</span>
<table class="overview top_margin">
<thead>
<tr><th>Symbol</th><th>Description</th><th>Line type</th></tr>
</thead>
<tbody>
<tr>
<th>^</th><td>caret lines</td><td>entity indicator line</td>
</tr>
<tr>
<th>!</th><td>bang lines</td><td>entity attribute line</td>
</tr>
<tr>
<th>#</th><td>hash lines</td><td>data table header description line</td>
</tr>
<tr>
<th>n/a</th><td>data lines</td><td>data table row</td>
</tr>
</tbody>
</table>
<span>For simplicity, these lines are referred to as caret lines, bang lines, hash lines, and data lines, respectively. </span>
</li>
<li>
<span>
<b>Label-value pairs</b>: Label-value pairs are the generic way that lines are organized.
Data lines are the only line types that are not organized in label-value pairs. Label-value pairs have the form:
</span>
<ul>
<li><span>[line-type character] [label] = [value]</span></li>
</ul>
</li>
<li>
<span>
<b>Entity types (caret lines)</b>: Entity type and its unique identifier are indicated as
a label-value pair on the caret lines. The entity's unique ID is any string of characters
different from any other entity ID within the document (i.e., locally unique).
As described in the Overview submitters supply three entity types: PLATFORM, SAMPLE and SERIES.
</span>
<table class="overview top_margin">
<thead>
<tr><th>Entity type</th><th>Example entity indicator line</th></tr>
</thead>
<tbody>
<tr>
<td>Platform</td><td>^PLATFORM = my_array_name</td>
</tr>
<tr>
<td>Sample</td><td>^SAMPLE = my_sample_name</td>
</tr>
<tr>
<td>Series</td><td>^SERIES = my_series_name</td>
</tr>
</tbody>
</table>
</li>
<li>
<span>
<a name="attributes" id="attributes"></a>
<b>Attributes (bang lines)</b>: Entity attributes are contained in bang lines and
immediately follow caret lines or other bang lines.
<p>The second column in the table indicates the 'number of allowed values' per attribute:
<ul>
<li><span>'1' indicates required, only one value allowed</span></li>
<li><span>'1 or more' indicates required, one or more values allowed</span></li>
<li><span>'0 or more' indicates not required, zero or more values allowed</span></li>
</ul>
</p>
<p>
Several Sample attributes have _[n] where [n] indicates the channel number. For example,
!Sample_label_ch[2]=Cy3 indicates that Cy3 was the label in one of the channels of a
two-color experiment. If the experiment is single channel, _[n] may be omitted from the attribute.
</p>
</span>
</li>
<li>
<span>
<b>Data table header description lines (hash lines)</b>: Data table header descriptions are
contained in hash lines and immediately follow caret lines, bang lines, or other hash lines.
Hash lines take the label-value pair form. Hash lines are used to provide a description of the
headers named in the header line of the data table.
</span>
</li>
</ul>
<div id="guidelines_tabs" class="jig-ncbitabs">
<ul>
<li><a href="#platform_tab">Platform</a></li>
<li><a href="#sample_tab">Sample</a></li>
<li><a href="#series_tab">Series</a></li>
</ul>
<div id="platform_tab">
<table class="overview">
<thead>
<tr><th>Label</th><th>Number of allowed labels</th><th>Allowed values and constraints</th><th>Content description</th></tr>
</thead>
<tbody>
<tr>
<td>^PLATFORM</td>
<td>1</td>
<td>any, must be unique within local file</td>
<td>Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records.</td>
</tr>
<tr>
<td>!Platform_title</td>
<td>1</td>
<td>string of length 1-120 characters, must be unique within local file and over all previously submitted Platforms for that submitter</td>
<td>Provide a unique title that describes the Platform.</td>
</tr>
<tr>
<td>!Platform_distribution</td>
<td>1</td>
<td>commercial, non-commercial, custom-commercial, or virtual</td>
<td>Microarrays are 'commercial', 'non-commercial', or 'custom-commercial' in accordance with how the array was manufactured. 'Virtual' is used only for high throughput sequencing or RT-PCR data.</td>
</tr>
<tr>
<td>!Platform_technology</td>
<td>1</td>
<td>spotted DNA/cDNA, spotted oligonucleotide, in situ oligonucleotide, antibody, tissue, SARST, RT-PCR, or MPSS</td>
<td>Select the category that best describes the Platform technology.</td>
</tr>
<tr>
<td>!Platform_organism</td>
<td>1 or more</td>
<td>use standard <a href="/Taxonomy/taxonomyhome.html/">NCBI Taxonomy</a> nomenclature</td>
<td>Identify the organism(s) from which the features on the Platform were designed or derived. </td>
</tr>
<tr>
<td>!Platform_manufacturer</td>
<td>1</td>
<td>any</td>
<td>Provide the name of the company, facility or laboratory where the array was manufactured or produced.</td>
</tr>
<tr>
<td>!Platform_manufacture_protocol</td>
<td>1 or more</td>
<td>any</td>
<td>Describe the array manufacture protocol. Include as much detail as possible, e.g., clone/primer set
identification and preparation, strandedness/length, arrayer hardware/software, spotting protocols.
</td>
</tr>
<tr>
<td>!Platform_catalog_number</td>
<td>0 or more</td>
<td>any</td>
<td>Provide the manufacturer catalog number for commercially-available arrays.</td>
</tr>
<tr>
<td>!Platform_web_link</td>
<td>0 or more</td>
<td>valid URL</td>
<td>Specify a Web link that directs users to supplementary information about the array.</td>
</tr>
<tr>
<td>!Platform_support</td>
<td>0 or 1</td>
<td>any</td>
<td>Provide the surface type of the array, e.g., glass, nitrocellulose, nylon, silicon, unknown.</td>
</tr>
<tr>
<td>!Platform_coating</td>
<td>0 or 1</td>
<td>any</td>
<td>Provide the coating of the array, e.g., aminosilane, quartz, polysine, unknown.</td>
</tr>
<tr>
<td>!Platform_description</td>
<td>0 or more</td>
<td>any</td>
<td>Provide any additional descriptive information not captured in another field, e.g.,
array and/or feature physical dimensions, element grid system.</td>
</tr>
<tr>
<td>!Platform_contributor</td>
<td>0 or more</td>
<td>each value in the form 'firstname,middleinitial,lastname' or 'firstname,lastname':
firstname must be at least one character and cannot contain spaces; middleinitial,
if present, is one character; lastname is at least two characters and can contain spaces.</td>
<td>List all people associated with this array design.</td>
</tr>
<tr>
<td>!Platform_pubmed_id</td>
<td>0 or more</td>
<td>an integer</td>
<td>Specify a valid PubMed identifier (PMID) that references a published article that describes the array. </td>
</tr>
<tr>
<td>!Platform_geo_accession</td>
<td>0 or 1</td>
<td>a valid Platform accession number (GPLxxx)</td>
<td></td>
</tr>
<tr>
<td>!Platform_table_begin</td>
<td>1</td>
<td>no content required</td>
<td>Indicates the start of the data table.</td>
</tr>
<tr>
<td>!Platform_table_end</td>
<td>1</td>
<td>no content required</td>
<td>Indicates the end of the data table.</td>
</tr>
</tbody>
</table>
</div>
<div id="sample_tab">
<table class="overview">
<thead>
<tr><th>Label</th><th>Number of allowed labels</th><th>Allowed values and constraints</th><th>Content description</th></tr>
</thead>
<tbody>
<tr>
<td>^SAMPLE</td>
<td>1</td>
<td>any, must be unique within local file</td>
<td>Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records.</td>
</tr>
<tr>
<td>!Sample_title</td>
<td>1</td>
<td>string of length 1-120 characters, must be unique within local file and over all previously submitted Samples for that submitter</td>
<td>Provide a unique title that describes this Sample.</td>
</tr>
<tr>
<td>!Sample_supplementary_file</td>
<td>1 or more</td>
<td>name of supplementary file, or 'none'</td>
<td>Examples of supplementary file types include original Affymetrix CEL and EXP files, and GenePix GPR files.
</td>
</tr>
<tr>
<td>!Sample_table</td>
<td>0 or 1</td>
<td>name of external CHP or tab-delimited file to be used as data table</td>
<td>- Affymetrix CHP file name:<br />
If the processed data are CHP files, reference the CHP file name in this field.
If the manuscript discusses data processed by
RMA or another algorithm, we recommend providing those values in the <a href="#stable">table section</a>.
There is no need to specify the !Sample_platform_id when CHP files are supplied.
</td>
</tr>
<tr>
<td>!Sample_source_name_ch[n]</td>
<td>1 per channel</td>
<td>any</td>
<td>Briefly identify the biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.</td>
</tr>
<tr>
<td>!Sample_organism_ch[n]</td>
<td>1 or more</td>
<td>use standard <a href="/Taxonomy/taxonomyhome.html/">NCBI Taxonomy</a> nomenclature</td>
<td>Identify the organism(s) from which the biological material was derived.</td>
</tr>
<tr>
<td>!Sample_characteristics_ch[n]</td>
<td>1 or more</td>
<td>'Tag: Value' format</td>
<td>
Describe all available characteristics of the biological source, including factors not necessarily under investigation.
Provide in 'Tag: Value' format, where 'Tag' is a type of characteristic (e.g. "gender", "strain", "tissue", "developmental stage", "tumor stage", etc), and 'Value' is the value for each tag (e.g. "female", "129SV", "brain", "embryo", etc).
</td>
</tr>
<tr>
<td>!Sample_biomaterial_provider_ch[n]</td>
<td>0 or more</td>
<td>any</td>
<td>Specify the name of the company, laboratory or person that provided the biological material.</td>
</tr>
<tr>
<td>!Sample_treatment_protocol_ch[n]</td>
<td>0 or more</td>
<td>any</td>
<td>Describe any treatments applied to the biological material prior to extract preparation.</td>
</tr>
<tr>
<td>!Sample_growth_protocol_ch[n]</td>
<td>0 or more</td>
<td>any</td>
<td>Describe the conditions that were used to grow or maintain organisms or cells prior to extract preparation.</td>
</tr>
<tr>
<td>!Sample_molecule_ch[n]</td>
<td>1 per channel</td>
<td>total RNA, polyA RNA, cytoplasmic RNA, nuclear RNA, genomic DNA, protein, or other</td>
<td>Specify the type of molecule that was extracted from the biological material.</td>
</tr>
<tr>
<td>!Sample_extract_protocol_ch[n]</td>
<td>1 or more</td>
<td>any</td>
<td>Describe the protocol used to isolate the extract material.</td>
</tr>
<tr>
<td>!Sample_label_ch[n]</td>
<td>1 per channel</td>
<td>any</td>
<td>Specify the compound used to label the extract e.g., biotin, Cy3, Cy5, 33P.</td>
</tr>
<tr>
<td>!Sample_label_protocol_ch[n]</td>
<td>1 or more</td>
<td>any</td>
<td>Describe the protocol used to label the extract.</td>
</tr>
<tr>
<td>!Sample_hyb_protocol</td>
<td>1 or more</td>
<td>any</td>
<td>Describe the protocols used for hybridization, blocking and washing, and any post-processing steps such as staining.</td>
</tr>
<tr>
<td>!Sample_scan_protocol</td>
<td>1 or more</td>
<td>any</td>
<td>Describe the scanning and image acquisition protocols, hardware, and software.</td>
</tr>
<tr>
<td>!Sample_data_processing</td>
<td>1 or more</td>
<td>any</td>
<td>Provide details of how data in the VALUE column of the table were generated and calculated, i.e., normalization method, data selection procedures and parameters, transformation algorithm (e.g., MAS5.0), and scaling parameters.</td>
</tr>
<tr>
<td>!Sample_description</td>
<td>0 or more</td>
<td>any</td>
<td>Include any additional information not provided in the other fields, or broad descriptions that cannot be easily dissected into the other fields.</td>
</tr>
<tr>
<td>!Sample_platform_id</td>
<td>1</td>
<td>a valid Platform identifier</td>
<td>Reference the Platform upon which this hybridization was performed.</td>
</tr>
<tr>
<td>!Sample_geo_accession</td>
<td>0 or 1</td>
<td>a valid Sample accession number (GSMxxx)</td>
<td></td>
</tr>
<tr>
<td>!Sample_anchor</td>
<td>1</td>
<td>SAGE enzyme anchor, usually NlaIII or Sau3A</td>
<td>Use for SAGE submissions only.</td>
</tr>
<tr>
<td>!Sample_type</td>
<td>1</td>
<td>SAGE</td>
<td>Use for SAGE submissions only.</td>
</tr>
<tr>
<td>!Sample_tag_count</td>
<td>1</td>
<td>sum of tags quantified in SAGE library</td>
<td>Use for SAGE submissions only.</td>
</tr>
<tr>
<td>!Sample_tag_length</td>
<td>1</td>
<td>base pair length of the SAGE tags, excluding anchor sequence</td>
<td>Use for SAGE submissions only.</td>
</tr>
<tr>
<td>!Sample_table_begin</td>
<td>1</td>
<td>no content required</td>
<td>Indicates the start of the data table.</td>
</tr>
<tr>
<td>!Sample_table_end</td>
<td>1</td>
<td>no content required</td>
<td>Indicates the end of the data table.</td>
</tr>
</tbody>
</table>
</div>
<div id="series_tab">
<table class="overview">
<thead>
<tr><th>Label</th><th>Number of allowed labels</th><th>Allowed values and constraints</th><th>Content description</th></tr>
</thead>
<tbody>
<tr>
<td>^SERIES</td>
<td>1</td>
<td>any, must be unique within local file</td>
<td>Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records.</td>
</tr>
<tr>
<td>!Series_title</td>
<td>1</td>
<td>string of length 1-255 characters, must be unique within local file and over all previously submitted Series for that submitter</td>
<td>Provide a unique title that describes the overall study.</td>
</tr>
<tr>
<td>!Series_summary</td>
<td>1 or more</td>
<td>any</td>
<td>Summarize the goals and objectives of this study. The abstract from the associated publication may be suitable.</td>
</tr>
<tr>
<td>!Series_overall_design</td>
<td>1</td>
<td>any</td>
<td>Provide a description of the experimental design. Indicate how many Samples are analyzed, if replicates are included, are there control and/or reference Samples, dye-swaps, etc.</td>
</tr>
<tr>
<td>!Series_pubmed_id</td>
<td>0 or more</td>
<td>an integer</td>
<td>Specify a valid PubMed identifier (PMID) that references a published article describing this study.
</td>
</tr>
<tr>
<td>!Series_web_link</td>
<td>0 or more</td>
<td>valid URL</td>
<td>Specify a Web link that directs users to supplementary information about the study.</td>
</tr>
<tr>
<td>!Series_contributor</td>
<td>0 or more</td>
<td>each value in the form 'firstname,middleinitial,lastname' or 'firstname,lastname': firstname must be at least one character and cannot contain spaces; middleinitial, if present, is one character; lastname is at least two characters and can contain spaces.</td>
<td>List all people associated with this study.</td>
</tr>
<tr>
<td>!Series_variable_[n]</td>
<td>0 or more</td>
<td>dose, time, tissue, strain, gender, cell line, development stage, age, agent, cell type, infection, isolate, metabolism, shock, stress, temperature, specimen, disease state, protocol, growth protocol, genotype/variation, species, individual, or other</td>
<td>Indicate the variable type(s) investigated in this study, e.g.,<br /> !Series_variable_1 = age <br /> !Series_variable_2 = age <br /> NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.</td>
</tr>
<tr>
<td>!Series_variable_description_[n]</td>
<td>0 or more</td>
<td>any</td>
<td>Describe each variable, e.g.,<br /> !Series_variable_description_1 = 2 months<br /> !Series_variable_description_2 = 12 months<br /> NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.</td>
</tr>
<tr>
<td>!Series_variable_sample_list_[n]</td>
<td>0 or more</td>
<td>each value a valid reference to a ^SAMPLE identifier, or all</td>
<td>List which Samples belong to each group, e.g.,<br /> !Series_variable_sample_list_1 = samA, samB <br /> !Series_variable_sample_list_2 = samC, samD <br /> NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.</td>
</tr>
<tr>
<td>!Series_repeats_[n]</td>
<td>0 or more</td>
<td>biological replicate, technical replicate - extract, or technical replicate - labeled-extract</td>
<td>Indicate the repeat type(s), e.g.,<br /> !Series_repeats_1 = biological replicate <br /> !Series_repeats_2 = biological replicate<br />NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.</td>
</tr>
<tr>
<td>!Series_repeats_sample_list_[n]</td>
<td>0 or more</td>
<td>each value a valid reference to a ^SAMPLE identifier, or all</td>
<td>List which Samples belong to each group, e.g., <br /> !Series_repeats_sample_list_1 = samA, samB<br /> !Series_repeats_sample_list_2 = samC, samD<br />NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.</td>
</tr>
<tr>
<td>!Series_sample_id</td>
<td>1 or more</td>
<td>valid Sample identifiers</td>
<td>Reference the Samples that make up this experiment. Reference the Sample accession numbers (GSMxxx) if the Samples already exists in GEO, or reference the ^Sample identifiers if they are being submitted in the same file.</td>
</tr>
<tr>
<td>!Series_geo_accession</td>
<td>0 or 1</td>
<td>a valid Series accession number (GSExxx)</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</div>
<a name="ptable" id="ptable"></a>
<h2>Platform data table format<a class="arrow" title="Back to top" href="#top">Back to top</a></h2>
<ul class="geo_doc_list">
<li><span>
A Platform data table should lie between the <i>!Platform_table_begin</i> and <i>!Platform_table_end</i> attributes.
</span></li>
<li><span>Data tables must be in plain text (ASCII) tab-delimited format. </span></li>
<li><span>
Each row of the Platform table is represented by its own unique identifier (ID). The ID column provided
in the Platform table corresponds to the ID_REF column used in accompanying Sample data tables -
there should be a 1:1 correspondence. Sample data tables should contain normalized data.
</span></li>
<li><span>
The Platform table must include meaningful, trackable, sequence identifiers (e.g. GenBank/RefSeq accessions,
locus tags, clone IDs, oligo sequences, chromosome locations, etc - see <a href="#headers">table below</a> for full list).
This information enables users to comprehensively interpret the data in compliance with
<a href="/geo/info/MIAME.html">MIAME standards</a>, and allows GEO to retrieve up-to-date
annotation for the Platform when incorporated into our downstream data query tools.
References to in-house databases or top BLAST hits are not sufficient.
</span></li>
</ul>
<h3>Standard Platform Headers</h3>
<p>
The first row in the Platform table is a header line that identifies the content of each column.
Column headers may be standard or non-standard. At least one standard column
(other than ID) is supplied with each Platform submission.
</p>
<p>
In addition to these standard columns, the data table may include any number of non-standard columns.
Examples of non-standard columns include array coordinate information, gene symbol or description,
gene ontology terms, quality indicators, etc. Columns may appear in any order after the ID column.
</p>
<p>
Standard column headers and their content are as follows:
</p>
<a name="headers" id="headers"></a>
<table class="overview">
<thead><tr><th>HEADER</th><th>CONTENT</th></tr></thead>
<tbody>
<tr>
<th>ID</th>
<td>
(Required) An identifier that unambiguously identifies
each row on the Platform table. Each ID within a Platform table must be unique.
This column heading should appear first and may be used only once in the data table.
Sample data tables should contain normalized data. If the normalization strategy requires taking the
average of replicate array features, the Platform should reflect the condensed template.
</td>
</tr>
<tr>
<th>SEQUENCE</th>
<td>The nucleotide sequence of each oligo, clone or PCR product.</td>
</tr>
<tr>
<th>GB_ACC</th>
<td>
GenBank accession - identifies a
biological sequence through the GenBank sequence accession number assigned
to the sequence, or the representative GenBank or RefSeq accession number upon
which the sequence was designed. It is recommended to include the version number of the accessions upon which the sequences were designed
(e.g., NM_022975.1 rather than NM_022975).
This is particularly important for RefSeq accessions which are updated frequently.
GenBank accessions representing the top BLAST hits for the sequences are not acceptable. Also,
chromosome, genome and contig accession numbers are generally not acceptable as they are not specific enough
to accurately identify the portion of the sequence printed on arrays (use GB_RANGE instead).
</td>
</tr>
<tr>
<th>GB_LIST</th>
<td>
GenBank accession list - as for GB_ACC, but allows more than one GenBank accession number to be presented. For example,
the sequences may have GenBank accession numbers representing both the 5' and 3' ends of the clones.
Multiple accession numbers should be separated using commas or spaces. Alternatively, more than one GB_ACC column may be supplied.
</td>
</tr>
<tr>
<th>GB_RANGE</th>
<td>
GenBank accession range - specifies a particular sequence position within a GenBank accession number.
Use format ACCESSION.VERSION[start..end]. Useful for tiling arrays.
</td>
</tr>
<tr>
<th>RANGE_GB</th>
<td>
Use format ACCESSION.VERSION. Should be used in conjunction with RANGE_START and RANGE_END. Useful for tiling arrays.
</td>
</tr>
<tr>
<th>RANGE_START</th>
<td>
Use in conjunction with RANGE_GB. Indicates the start position (relative to the RANGE_GB accession). Useful for tiling arrays.
</td>
</tr>
<tr>
<th>RANGE_END</th>
<td>
Use in conjunction with RANGE_GB. Indicates the end position (relative to the RANGE_GB accession). Useful for tiling arrays.
</td>
</tr>
<tr>
<th>RANGE_STRAND</th>
<td>
Use in conjunction with RANGE_GB. Indicates the strand represented. Use + or - or empty. Useful for tiling arrays.
</td>
</tr>
<tr>
<th>GI</th>
<td>
GenBank identifier - as for GB_ACC, but specify the GenBank identifier number rather than the GenBank accession number.
</td>
</tr>
<tr>
<th>GI_LIST</th>
<td>
GenBank identifier list - as for GI, but allows more than one GenBank identifier to be presented.
Multiple GIs should be separated using commas or spaces. Alternatively, more than one GI column may be supplied.
</td>
</tr>
<tr>
<th>GI_RANGE</th>
<td>
GenBank identifier range - specifies a particular sequence position on a GenBank identifier number. Use format GI[start..end].
</td>
</tr>
<tr>
<th>CLONE_ID</th>
<td>
Clone identifier - identifies a biological sequence
through a standard clone identifier. Only CLONE_IDs that can be used to identify
the sequence through an NCBI or other public-database
query should be provided in this column. Examples include FlyBase IDs,
RIKEN clone IDs and IMAGE clone numbers.
</td>
</tr>
<tr>
<th>CLONE_ID_LIST</th>
<td>
CLONE_ID list - as for CLONE_ID, but allows more than one clone identifier to be presented.
Multiple Clone IDs should be separated using commas or spaces. Alternatively, more than one CLONE_ID column may be supplied.
</td>
</tr>
<tr>
<th>ORF</th>
<td>
Open reading frame designator - identifies a biological sequence through an experimentally or
computationally derived open reading frame identifier. The ORF designator is
intended to represent a known or predicted DNA coding region or locus_tag identified
in <a href="/datasets/">NCBI Datasets</a> division.
It may be appropriate to include a GENOME_ACC column to reference the GenBank accession from which the ORF names are derived.
</td>
</tr>
<tr>
<th>ORF_LIST</th>
<td>
ORF list - as for ORF, but allows more than one open reading frame designator to be presented.
Multiple ORFs should be separated using commas or spaces. Alternatively, more than one ORF column may be supplied.
</td>
</tr>
<tr>
<th>GENOME_ACC</th>
<td>
Genome accession number - specifies the GenBank or RefSeq genome accession number from which ORF identifiers are derived. It is
important to include the version number of the genome accession upon which the sequences were generated (e.g., NC_004721.1 rather than NC_004721) because updates to the
genome sequence may render the ORF designations incorrect.
</td>
</tr>
<tr>
<th>SNP_ID</th>
<td>
SNP identifier - typically specifies a dbSNP refSNP ID with format rsXXXXXXXX.
</td>
</tr>
<tr>
<th>SNP_ID_LIST</th>
<td>
SNP identifier list - as for SNP_ID, but allows more than one SNP_ID to be presented.
Multiple SNP_IDs should be separated using commas or spaces. Alternatively, more than one SNP_ID column may be supplied.
</td>
</tr>
<tr>
<th>miRNA_ID</th>
<td>
microRNA identifier - typically has format e.g., hsa-let-7a or MIRNLET7A2.
</td>
</tr>
<tr>
<th>miRNA_ID_LIST</th>
<td>
microRNA identifier list - as for miRNA_ID, but allows more than one miRNA_ID to be presented.
Multiple miRNA_IDs should be separated using commas or spaces. Alternatively, more than one miRNA_ID column may be supplied.
</td>
</tr>
<tr>
<th>SPOT_ID</th>
<td>
Alternative spot identifier - use only when no identifier or sequence tracking information is available.
This column is useful for designating control and empty features.
</td>
</tr>
<tr>
<th>ORGANISM</th>
<td>
The organism source of each feature on the array.
This is most useful for when the array contains sequences derived from multiple organisms.
</td>
</tr>
<tr>
<th>PT_ACC</th>
<td>
Protein accession - identifies any GenBank or RefSeq protein accession number. Protein accession numbers
should only be supplied for protein arrays. Nucleotide accession numbers should be
supplied for nucleotide arrays.
</td>
</tr>
<tr>
<th>PT_LIST</th>
<td>
Protein accession list - as for PT_ACC, but allows more than one protein accession number to be presented.
Multiple accession numbers should be separated using commas or spaces. Alternatively, more than one PT_ACC column may be supplied. Protein accession numbers
should only be supplied for protein arrays. Nucleotide accession numbers should be
supplied for nucleotide arrays.
</td>
</tr>
<tr>
<th>PT_GI</th>
<td>
Protein GenBank or RefSeq identifier. Protein identifiers should only be supplied for protein arrays or proteomic mass
spectrometry Platforms. Nucleotide identifiers should be supplied for nucleotide arrays.
</td>
</tr>
<tr>
<th>PT_GI_LIST</th>
<td>
Protein identifier list - as for PT_GI, but allows more than one protein identifier to be presented.
Multiple identifiers should be separated using commas or spaces. Alternatively, more than one PT_GI column may be supplied. Protein identifiers
should only be supplied for protein arrays. Nucleotide identifiers should be
supplied for nucleotide arrays.
</td>
</tr>
<tr>
<th>SP_ACC</th>
<td>
SwissProt accession. SwissProt accession numbers
should only be supplied for protein arrays. Nucleotide accession numbers should be
supplied for nucleotide arrays.
</td>
</tr>
<tr>
<th>SP_LIST</th>
<td>
SwissProt accession list - as for SP_ACC, but allows more than one SwissProt accession number to be presented.
Multiple accession numbers should be separated using commas or spaces. Alternatively, more than one SP_ACC column may be supplied. SwissProt accession numbers
should only be supplied for protein arrays. Nucleotide accession numbers should be
supplied for nucleotide arrays.
</td>
</tr>
</tbody>
</table>
<a name="stable" id="stable"></a>
<h2>Sample data table content<a class="arrow" title="Back to top" href="#top">Back to top</a></h2>
<ul class="geo_doc_list">
<li><span>
A Sample data table should lie between the <i>!Sample_table_begin</i> and <i>!Sample_table_end</i> attributes
(unless supplying Affymetrix CHP files or external text files, see <a href="#guidelines_tabs" class="open_sample_tab">!Sample_table</a> attribute description).
</span></li>
<li><span>Normalized values should be included in the table.</span></li>
<li><span>
The Sample data table should only contain information that pertains to the quantification measurements. With the exception of the ID information,
no annotation data that can be found on the reference Platform should be included in the Sample record.
</span></li>
</ul>
<h3>Sample data table headers and content</h3>
<p>
The first row in the file must be a header line that identifies the content of each column. The two required columns are listed below.
In addition to the required columns, submitters may supply any number of auxiliary non-standard columns describing,
for example, supporting measurements and calculations, quality evaluations or flags. Columns may appear in any order after the
ID_REF column.
</p>
<ul class="geo_doc_list">
<li><span><b>
ID_REF</b>: (Required) Identifier reference - these should match the
unique identifiers given in the identifier (ID) column of the corresponding Platform data table.
</span></li>
<li><span>
<b>VALUE</b>: (Required) These values should be the final, normalized quantification measurements that are comparable across rows and Samples,
and preferably processed as described in any accompanying manuscript.
Values that should be discarded (e.g., background higher than count, or otherwise flagged as 'bad')
should either be left blank or labeled as "null".
<ul>
<li><span>For single channel data, this column should contain normalized (scaled) signal count data.</span></li>
<li><span>For dual channel data, this column should contain normalized log ratio data (preferably test/reference).</span></li>
</ul>
</span>
</li>
</ul>
<a name="examples" id="examples"></a>
<h2>SOFT file examples<a class="arrow" title="Back to top" href="#top">Back to top</a></h2>
<p>The following examples (data tables truncated at 20 rows) represent valid GEO SOFT submissions: </p>
<ul class="geo_doc_list">
<li><span>An example of a SOFT file containing <a href="/geo/info/soft_ex_platform.txt">a single Platform submission</a>.</span></li>
<li><span>An example of a SOFT file containing <a href="/geo/info/soft_ex_dual.txt">three dual channel Sample submissions</a>.</span></li>
<li><span>An example of a SOFT file containing <a href="/geo/info/soft_ex_series.txt">a single Series submission</a>.</span></li>
<li><span>An example of a SOFT file containing <a href="/geo/info/soft_ex_family.txt">a family (Platform, Samples and Series) submission</a>.</span></li>
<li><span>An example of a SOFT file containing <a href="/geo/info/soft_ex_affy.txt">three Affymetrix Samples and one Series submission</a>.</span></li>
<li><span>An example of a SOFT file containing <a href="/geo/info/soft_ex_affy_chp.txt">three Affymetrix Samples and one Series submission referencing CHP files</a>.</span></li>
</ul>
<a name="download" id="download"></a>
<h2> SOFT download<a class="arrow" title="Back to top" href="#top">Back to top</a></h2>
<p>
SOFT format for batch download contains a few additional attributes in the output, including:
</p>
<p>
_geo_accession<br />
_status<br />
_submission_date<br />
_last_update_date<br />
_row_count<br />
_contact_name<br />
_contact_email<br />
_contact_institute<br />
_contact_department<br />
_contact_city<br />
_contact_phone<br />
_contact_fax<br />
_contact_web_link<br />
Sample_channel_count<br />
Series_type<br />
</p>
</div>
</div>
<div id="last_mod">
Last modified: July 16, 2024</div>
<div id="footer">
<span class="helpbar">|<a href="https://www.nlm.nih.gov"> NLM </a>|<a href="https://www.nih.gov"> NIH </a>|<a href="mailto:geo@ncbi.nlm.nih.gov"> Email GEO </a>|<a href="/geo/info/disclaimer.html"> Disclaimer </a>|<a href="https://www.nlm.nih.gov/accessibility.html"> Accessibility </a>|<a href="https://www.hhs.gov/vulnerability-disclosure-policy/index.html"> HHS Vulnerability Disclosure </a>|
</span>
</div>
</div>
<script type="text/javascript" src="https://www.ncbi.nlm.nih.gov/portal/portal3rc.fcgi/rlib/js/InstrumentOmnitureBaseJS/InstrumentNCBIBaseJS/InstrumentPageStarterJS.js"></script>
</body>
</html>