nih-gov/www.ncbi.nlm.nih.gov/data_specs/NCBI_data_in_XML.html

85 lines
3 KiB
HTML

<html>
<head>
<title>NCBI Data in XML</title>
</head>
<body>
<h1>NCBI Data in XML</h1>
<h2>Introduction</h2>
<p>
Extensible Markup Language (XML) plays an increasingly important role in
the exchange of wide variety of data on the Web and elsewhere. In early
1990s NCBI chose a language called Abstract Syntax Notation 1 (ASN.1) for
describing and exchanging information in a manner similar to the ways XML
is now used. ASN.1 came out of the telecommunications industry and is a
compact text or binary encoding intended for both human readable text as
well as integers, floating point numbers, and so on. Tools for ASN.1 have
largely stayed within the commercial telecommunications industry while a
host of public domain tools of varying character have arisen for XML.
</p>
<p>
While ASN.1 remains the primary data specification language at NCBI, our
toolkit also supports XML input and output. An ASN.1 specification can be
rendered into an XML DTD or schema. Data encoded in ASN.1 can be output in
XML which will validate against the DTD using standard XML tools. We hope
this will make the structured sequence, map, and structure data, as well
as the output of tools like BLAST, more accessible to those who wish to
work in XML.
</p>
<h2>Data Conversion Details</h2>
<p>
Please note that the conversion of existing ASN.1 specified data into XML
has some limitations.
<br />
ASN.1 has a number of specific data types such as INTEGER or REAL numbers
while XML DTD has only strings, so our DTD automatically adds some ENTITY
definitions at the top which maps these numbers to strings - to allow
humans that read the DTD to see where numbers are expected. At the same
time, when converting an ASN.1 specification into XML schema, our tools
correctly map ASN.1 data types into corresponding XML schema ones.
ASN.1 does not require that an element name be unique except within a
structure, similar to C or C++. XML DTD however requires that all names be
unique across the DTD, unless they are attributes which must come from a
limited repertoire. Many XML parsers rely on this so that callback
functions are associated with a tag, not a tag within context. As a
trivial illustration, if both people and genes have names, they are
distinct in ASN.1:
<pre>
Person ::= SEQUENCE {
name VisibleString,
room-number INTEGER }
Gene ::= SEQUENCE {
name VisibleString,
map VisibleString }
</pre>
but must be made unique in XML to be distinguished. To do so, we prefix
all element names with the name of the context structure:
<pre>
&lt;!ELEMENT Person ( Person_name, Person_room-number )&gt;
&lt;!ELEMENT Person_name (#PCDATA)&gt;
&lt;!ELEMENT Person_room-number (#PCDATA)&gt;
&lt;!ELEMENT Gene (Gene_name, Gene_map)&gt;
&lt;!ELEMENT Gene_name (#PCDATA)&gt;
&lt;!ELEMENT Gene_map (#PCDATA)&gt;
</pre>
While this is a default behavior, our tools allow omitting such prefixes
if needed - for example, when XML DTD was the original specification.
</p>
<hr />
<p>
Please email questions at:
<a href="mailto:info@ncbi.nlm.nih.gov">info@ncbi.nlm.nih.gov</a>
</p>
<p>
Last updated: Aug 26, 2005
</p>
</body>
</html>