Exploring Entrez Direct: Parsing the XML Output of E-utilities

Entrez Direct is a UNIX/LINUX command-line interface to E-utilities, the API to the NCBI Entrez system. One of Entrez Direct’s most useful features is its ability to parse and reformat complex XML data returns from EFetch. In this post, we will explore how to use these features to parse, reformat and process specific data from PubMed records downloaded in XML using EFetch. Though this post focuses on PubMed, the technique is universal and applies to any XML returned by E-utilities from any database. The example explored here is also presented briefly in the Entrez Direct documentation; here we’ll dive in a bit depeer to see how it works. Let’s get started!

The goals is to identify the authors who have published the most papers in PubMed on a particular topic; in this case, the topic will be the phospholipase from rattlesnakes. The output should be a list of authors sorted by the number of publications.

Here’s the complete LINUX shell scripts to accomplish the task:

esearch -db pubmed -query
"crotalid venoms [MAJR] AND phospholipase [TIAB]" | \
efetch -format xlm | \
xtract -pattern PubmedArticle \
-block Author -sep " " -tab "\n" - element LastName,Initials | \
sort-uniq-count-rank

So how does this work? The first three lines are standard E-utility functions that search Pubmed with the query as shown, then download the resulting records in XML. The desired data, however, is still buried in XML at this point and will need to be parsed out and reformatted to make it useful.

A tool to accomplish this task is xtract, a utility unique to Entrez Direct that can do the data wrangling for us.

First, we have to examine the XML returned by EFetch to find the data elements that we want. Here’s a piece of a PubMed record in XML showing the list of authors:

<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Rodrigues</LastName>
<ForeName>Mariana A P</ForeName>
<Initials>MA</Initials>
<AffiliationInfo>
<Affiliation>Departamento de Farmacologia, Faculdade de Ciências Médicas, Universidade Estadual de Campinas (UNICAMP), Rua Tessália Vieira de Camargo, 126, Cidade Universitária Zeferino Vaz, 13083-887, Campinas, SP, Brazil.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Dias</LastName>
<ForeName>Lourdes</ForeName>
<Initials>L</Initials>
<AffiliationInfo>
<Affiliation>Departamento de Farmacologia, Faculdade de Ciências Médicas, Universidade Estadual de Campinas (UNICAMP), Rua Tessália Vieira de Camargo, 126, Cidade Universitária Zeferino Vaz, 13083-887, Campinas, SP, Brazil.</Affiliation>
</Author>
...
</AuthorList>

The data we want for each author are the <LastName>, <ForeName>, and <Initials> tags contained in the <Author> block. (For this example, we’ll only use the <LastName> and <Initials> data.) These data, in turn, are contained within the global <PubmedArticle> block that contains all data for a single PubMed record.

Now we can begin building the xtract command. The –pattern option specifies the block for the entire record, the –block option specifies the block within the record that contains the target data, and the –element option specifies the particular data tags within the block. The simple xtract command would be as follows:

xtract -pattern PubmedArticle -block Author -element LastName,Initials

Each line of this command’s output will contain a tab-delimited list of LastName and Initials data for the authors from a single PubMed record, but to rank the authors across all returned records, we need to separate the authors into a single list. This is what the –tab “\n” option does: it replaces the default separator (a tab) between each value returned by –element with a newline character (\n). Now each LastName and Initials pair is written on a separate line. To make the data more readable by inserting a space between the LastName and Initials, we use the –sep ” “ option to replace the default separator (a tab) between the data fields specified in the –element option.

We’re almost there! We now need only count the number of times each author appears in the list and then sort the list by that count. In LINUX, we can do this very easily by using the sort and uniq functions. Herein lies one of the great advantages of Entrez Direct, in that we can simply “pipe” the output to sort and uniq. This is such a common task that Entrez Direct contains a ready-made option to do it in one step: sort-uniq-count-rank.

With that, the script is complete. By using the –sep and –tab options of xtract, we can reformat the parsed XML data in a wide variety of ways. We encourage you to explore these options and please comment on this post with your ideas and suggestions!

One thought on “Exploring Entrez Direct: Parsing the XML Output of E-utilities

Leave a Reply