nih-gov/www.ncbi.nlm.nih.gov/Web/Newsltr/SummerFall04/index.html

215 lines
No EOL
16 KiB
HTML

<html lang="eng">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="generator">
<title>NCBI News:Summer 2004|eUtils</title>
<style type="text/css">
<!--
a:hover { color: 993300; text-decoration:underline}
-->
</style>
<link rel="stylesheet" href="ncbinews.css" type="text/css">
<script language="JavaScript" type="text/JavaScript">
<!--
function MM_goToURL() { //v3.0
var i, args=MM_goToURL.arguments; document.MM_returnValue = false;
for (i=0; i<(args.length-1); i+=2) eval(args[i]+".location='"+args[i+1]+"'");
}
//-->
</script>
<link rel="stylesheet" href="ncbinews.css" type="text/css">
<style type="text/css">
<!--
.style1 {color: #333333}
-->
</style>
</head>
<body background="images/bckgrnd.gif" bgcolor="white" link="#003399" alink="#CC6600" vlink="#003399" text="black" leftmargin="5" topmargin="5" marginwidth="5" marginheight="5">
<span class="heads"></span> <span class="subheads"></span>
<table border="0" cellpadding="0" cellspacing="0" valign="left" class="tables">
<!--DWLayoutTable-->
<tr height="176">
<td height="176" colspan="2" valign="left" align="left"><img height="12" width="8" src="images/dotclear.gif" alt=""><a href="http://www.ncbi.nlm.nih.gov"><img src="images/logo.gif" alt="NCBI Logo" width="173" height="171" border="0"></a></td>
<td height="176" valign="top" width="10" align="left"></td>
<td height="176" valign="top" colspan="2"><img height="80" width="364" src="images/msthd1.gif" border="0" alt="NCBI News" usemap="#E">
<map name="E">
<area href="http://www.ncbi.nlm.nih.gov/About/newsletter.html" coords="1,17,362,72" shape="rect" alt="NCBI News banner" title="NCBI News Masthead">
</map>
<br>
<table width="488" border="0" cellspacing="0" cellpadding="0">
<tr valign="top">
<td width="380" height="86"><img height="80" width="340" src="images/msthd1b.gif" border="0" alt="National Center for Biotechnology Information" usemap="#NCBI" vspace="3">
<map name="NCBI">
<area href="http://www.dhhs.gov" coords="0,60,221,74" shape="rect" alt="US Department of Health and Human Services" title="US Department of Health and Human Services">
<area href="http://www.ncbi.nlm.nih.gov" coords="0,6,268,21" shape="rect" alt="National Center for Biotechnology Information" title="National Center for Biotechnology Information">
<area href="http://www.nlm.nih.gov" coords="0,25,147,37" shape="rect" alt="National Library of Medicine" title="National Library of Medicine">
<area href="http://www.nih.gov" coords="0,42,147,56" shape="rect" alt="National Institutes of Health" title="National Institutes of Health">
</map>
</td>
<td width="108" height="86">
<div align="right"><img src="images/edition1.gif" alt="fall 2003 issue of NCBI News" width="100" height="80"></div>
</td>
</tr>
</table>
</td>
</tr>
<tr valign="top">
<td width="13" rowspan="4" align="left" valign="top"><img height="10" width="13" src="images/dotclear.gif" alt=" "></td>
<td width="173" rowspan="4" align="left" valign="top">
<table border="0" cellpadding="0" cellspacing="0" width="120" valign="left" name="Navigation">
<tr height="36">
<td width="130" height="36" valign="top"><br>
<a href="http://www.ncbi.nlm.nih.gov/About/newsletter.html"><img src="images/pastissue.gif" alt="click to go to index of past issues" width="120" height="33" border="0"></a><br>
<br>
<img height="33" width="120" alt="In this issue" src="images/issue.gif"><br>
<br>
<span class="links2"><br>
</span>
<span class="navon">Entrez Programming Utilities (E-Utils)</span><br>
<br>
<a href="pubchem.html" class="navoff">PubChem</a><br>
<br>
<a href="geneplot.html" class="navoff">GenePlot</a><br>
<br>
<a href="nlmcat.html" class="navoff">New NLM Catalog in Entrez</a><br>
<br>
<a href="newbuilds.html" class="navoff">New Genome Builds</a><br>
<br>
<a href="recent.html" class="navoff">New Microbial Genomes in GenBank</a><br>
<br>
<a href="wgs.html" class="navoff">Whole Genome Shotgun Project</a><br>
<br>
<a href="wblast.html" class="navoff">Web BLAST</a><br>
<br>
<a href="traceA.html" class="navoff">Trace Archive Grows</a><br>
<br>
<a href="unigene.html" class="navoff">New Organisms in UniGene</a><br>
<br>
<a href="refseq.html" class="navoff">RefSeq Version 8</a><br>
<br>
<a href="submissions.html" class="navoff">Submissions Corner</a><br>
<br>
<a href="defline.html" class="navoff">Predicted Records</a><br>
<br>
<a href="GBrel.html" class="navoff">GenBank Release 144</a><br>
<br>
<a href="b2210.html" class="navoff">BLAST 2.2.10</a><br>
<br>
<a href="pubs.html" class="navoff">Publications</a>
<br>
<br><a href="masthead.html" class="navoff">Masthead</a> </td>
</tr>
</table> <p>&nbsp;</p></td>
<td height="2149" valign="left" bordercolor="003399"></td>
<td width="490" valign="top">
<div valign="left">
<p><br>
<br>
<br>
<br>
<span class="headlines">Entrez Programming Utilities</span><span class="bodycopy" width="488"><br>
<br>
In dealing with specialized datasets, researchers are often restricted to one of two unattractive choices: either to download an ftp archive containing far more than the data of interest, followed by a round of local parsing, or to access the data interactively, even though the volume of data may render this method cumbersome. To help with the latter method, NCBI provides a suite of programs called the Entrez Programming Utilities (E-Utilities) that allow automated access to the Entrez databases.<br><br>
</span><span class="bodytext3" width="488">What Are the E-Utilities?</span><span class="bodycopy" width="488"><br><br>
The E-Utilities are a set of seven server-side programs that provide a stable interface to the search, retrieval, and linking functions of the Entrez system, using a fixed URL syntax. The output provided by the E-Utilities is in XML format, with the notable exception of the EFetch utility, which returns data in a variety of formats. The E-Utilities are designed to be called from within a computer program that can process their output. Calling an E-Utility from any of the common programming languages&mdash; including Perl, Python, and Java&mdash;is a simple matter of posting a URL.<br><br>
</span><span class="bodytext3" width="488">The E-Utilities Implement Entrez Functions</span><span class="bodycopy" width="488"><br><br>
Each of the E-Utilities performs a basic task within the Entrez system, and six of the E-Utilities have a direct equivalent in interactive Entrez (Box 1 on page 3). For instance, typing a text query into the NCBI home page and clicking &ldquo;Go&rdquo; causes Entrez to search for matches across all Entrez databases and list the number of matching records for each. This &quot;Global Query&quot; function is implemented by EGQuery. If a single database is queried, Entrez first maps the query to a set of integers, or unique identifiers (UIDs), for matching records in the selected database. Entrez UIDs are sometimes referred to as GI numbers for nucleotide and protein, PMIDs for PubMed, and MMDB-IDs for Structures. Entrez queries and the subsequent list of matching UIDs are implemented using ESearch. On the web, Entrez searches are automatically followed by displays of brief record listings, called Document Summaries (DocSums), for matching records. This functionality is implemented by ESummary. Access to full records in an Entrez database on the Web is provided by clicking on the accession of a displayed DocSum. These functions are implemented by EFetch. Acces-sing records linked to a given record on the Web is as simple as clicking on a link in the Links menu to the right of a DocSum. This linking function is provided by ELink. On the web, Batch Entrez is used to upload a list of UIDs; this function is provided by EPost. EPost places UIDs on the Entrez History server that stores the results of previous searches during an Entrez session, as can be done on the Web using the Preview/Index or History tabs. The only E-Utility that does not have a direct Web parallel is EInfo, which provides the vital statistics of Entrez databases such as the date of the last update, a list of links to other databases, and a list of indexed fields.<br>
</span><span class="bodytext3" width="488"><br>
The E-Utilities Search for Data</span><span class="bodycopy" width="488"><br><br>
Suppose that a researcher wants to find all human RefSeq protein records that have links to Online Mendelian Inheritance in Man (OMIM), and thereby have an associated phenotype. This can be done by posting the ESearch URL shown in Example 1 of ESearch in Box 1:<br>
This URL produces XML output, a portion of which is shown below:<br><br>
</span><span class="bodytext2" width="488">&lt;Count&gt;14988&lt;/Count&gt;<br>
&lt;RetMax&gt;20&lt;/RetMax&gt;<br>
&lt;RetStart&gt;0&lt;/RetStart&gt;<br>
&lt;QueryKey&gt;47&lt;/QueryKey&gt;<br>
&lt;WebEnv&gt;0hh9nVItHLfyYJGaMMIh_T0ptRqIsaiaikdx5k_yhaM0S72qC5x-AY&lt;/WebEnv&gt;</span><span class="bodycopy" width="488"><br><br>
Included is the number of records (14,988) matching the query along with the two parameters that define the location of the data set on the History server: the Query Key, with a value of 47, and the Web Environ-ment (WebEnv), with a value of &ldquo;0hh9nVItHLfyYJGaMMI . . . .&rdquo; The latter is a string associated with the internet cookie for the Entrez session.<br><br>
</span><span class="bodytext3" width="488">The E-Utilities Retrieve Data</span><span class="bodycopy" width="488"><br><br>
Retrieving the actual records identified in the above search is performed either using ESummary to retrieve DocSums or using EFetch to retrieve formatted records, such as FASTA sequence. Other available sequence formats include GenBank, GenPept, and INSDSeq XML, which can be selected using the &amp;rettype parameter. One consideration to bear in mind is that EFetch is limited to 500 records per URL. Therefore, to retrieve FASTA sequence for all 14,988 records, a loop within the calling program will be required to post 30 URLs, the second of which, to retrieve records 500-999, is shown in the EFetch section of Box 1.<br><br>
The remaining 28 URLs would differ only in the value of the &amp;retstart parameter, which would increment by 500 in each successive call within the loop.<br><br>
</span><span class="bodytext3" width="488">The E-Utilities Limit and Link Datasets</span><span class="bodycopy" width="488"><br><br>
To find the annotated genes associated with a select group of this set of RefSeq proteins, namely those that have interleukin 22 in their title, another ESearch URL can be used, as listed under Example 2 of the ESearch section of Box 1, where &quot;%2347&quot; is the URL encoding for &quot;#47&quot; and refers to our previous query key. The five resulting GIs can be extracted from the XML and used as input to ELink, shown in the ELink section of Box 1.<br><br>
Since each GI was assigned in a separate &amp;id parameter, the XML output will contain separate lists of linked GeneIDs for each protein GI. A simple analysis of the results reveals that the second, third, and fourth protein GIs are linked to the same GeneID, revealing the three transcriptional variants of the interleukin 22 binding protein, a name in turn retrieved by a single ESummary call with that common GeneID.<br><br>
By using additional combinations of E-Utilities calls, a wide array of data pipelines can be constructed easily and used to process large numbers of data records.<br>
<br>
For more information, see the following:</span></p>
<p class="bodycopy">E-utility online documentation: </p>
<table width="488" border="0" cellspacing="1" cellpadding="0">
<tr>
<td width="488" height="25" align="center" bgcolor="#dfefff">
<div align="center" class="links2"><a href="http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html">eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html</a></div>
</td>
</tr>
</table>
<p class="bodycopy">NCBI PowerScripting, a new NCBI course on programming with the E-utils: </p>
<table width="488" border="0" cellspacing="1" cellpadding="0">
<tr>
<td width="488" height="25" align="center" bgcolor="#dfefff">
<div align="center" class="links2"><a href="http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/course.html">www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/course.html<br>
</a></div>
</td>
</tr>
</table>
<p align="left" class="bodycopy">Building Customized Data Pipelines Using the Entrez Programming Utilities (eutils):</p>
<table width="488" border="0" cellspacing="1" cellpadding="0">
<tr>
<td width="488" height="25" align="center" bgcolor="#dfefff">
<div align="center" class="links2"><a href="http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/chapter_eutils.pdf">www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/chapter_eutils.pdf<br>
</a></div>
</td>
</tr>
</table>
<p align="right"><span class="authors">&mdash;ES</span></p>
</div></td>
<td width="15"></td>
</tr>
<tr valign="top">
<td height="111" valign="left" bordercolor="003399"></td>
<td valign="top" bgcolor="#DFEFFF"><p align="center" class="bodytext3">Helpful E-Utility URLs and E-Utility Samples - The base E-Utility URL:</p> <p align="center"><a href="http://eutils.ncbi.nlm.nih.gov/entrez/eutils" class="links2">eutils.ncbi.nlm.nih.gov/entrez/eutils</a></p> <p align="center"><a href="box1.html" target="_blank">Click here to open Box 1 which contains helpful E-Utility URLs and E-Utility samples</a> </p></td>
<td></td>
</tr>
<tr valign="top">
<td height="154" valign="left" bordercolor="003399"></td>
<td valign="top"><div valign="left">
<p align="right"><a href="pubchem.html"><img height="27" width="69" src="images/continue.gif" border="0" alt="to next article" title="to Cancer Chromosomes"></a>
<div align="right"></div>
<hr noshade size="1" align="right" width="488">
<div align="right"><img src="images/foot1.gif" alt="NCBI News | Summer 2003" width="187" height="32" border="0" usemap="#NCBI News footMap">
<map name="NCBI News footMap">
<area href="http://www.ncbi.nlm.nih.gov/About/newsletter.html" coords="0,8,185,30" shape="rect" alt="NCBI News: Spring 2004" title="NCBI News">
</map>
<br>
<br>
<br>
</div>
<p></p>
</div></td>
<td></td>
</tr>
<tr valign="top">
<td height="48" valign="left" bordercolor="003399"></td>
<td>&nbsp;</td>
<td></td>
</tr>
<tr valign="top">
<td height="52">&nbsp;</td>
<td>&nbsp;</td>
<td valign="left" bordercolor="003399"></td>
<td>&nbsp;</td>
<td></td>
</tr>
</table>
<p class="bodytext4">&nbsp;</p>
<p class="captions2">&nbsp;</p>
<p class="tables2">&nbsp;</p>
<p class="tables2">&nbsp;</p>
<p class="tables2">&nbsp;</p>
</body>
</html>