85 lines
3 KiB
HTML
85 lines
3 KiB
HTML
<html>
|
|
<head>
|
|
<title>NCBI Data in XML</title>
|
|
</head>
|
|
<body>
|
|
|
|
<h1>NCBI Data in XML</h1>
|
|
<h2>Introduction</h2>
|
|
|
|
<p>
|
|
Extensible Markup Language (XML) plays an increasingly important role in
|
|
the exchange of wide variety of data on the Web and elsewhere. In early
|
|
1990s NCBI chose a language called Abstract Syntax Notation 1 (ASN.1) for
|
|
describing and exchanging information in a manner similar to the ways XML
|
|
is now used. ASN.1 came out of the telecommunications industry and is a
|
|
compact text or binary encoding intended for both human readable text as
|
|
well as integers, floating point numbers, and so on. Tools for ASN.1 have
|
|
largely stayed within the commercial telecommunications industry while a
|
|
host of public domain tools of varying character have arisen for XML.
|
|
</p>
|
|
<p>
|
|
While ASN.1 remains the primary data specification language at NCBI, our
|
|
toolkit also supports XML input and output. An ASN.1 specification can be
|
|
rendered into an XML DTD or schema. Data encoded in ASN.1 can be output in
|
|
XML which will validate against the DTD using standard XML tools. We hope
|
|
this will make the structured sequence, map, and structure data, as well
|
|
as the output of tools like BLAST, more accessible to those who wish to
|
|
work in XML.
|
|
</p>
|
|
|
|
<h2>Data Conversion Details</h2>
|
|
<p>
|
|
Please note that the conversion of existing ASN.1 specified data into XML
|
|
has some limitations.
|
|
<br />
|
|
ASN.1 has a number of specific data types such as INTEGER or REAL numbers
|
|
while XML DTD has only strings, so our DTD automatically adds some ENTITY
|
|
definitions at the top which maps these numbers to strings - to allow
|
|
humans that read the DTD to see where numbers are expected. At the same
|
|
time, when converting an ASN.1 specification into XML schema, our tools
|
|
correctly map ASN.1 data types into corresponding XML schema ones.
|
|
ASN.1 does not require that an element name be unique except within a
|
|
structure, similar to C or C++. XML DTD however requires that all names be
|
|
unique across the DTD, unless they are attributes which must come from a
|
|
limited repertoire. Many XML parsers rely on this so that callback
|
|
functions are associated with a tag, not a tag within context. As a
|
|
trivial illustration, if both people and genes have names, they are
|
|
distinct in ASN.1:
|
|
|
|
<pre>
|
|
Person ::= SEQUENCE {
|
|
name VisibleString,
|
|
room-number INTEGER }
|
|
|
|
Gene ::= SEQUENCE {
|
|
name VisibleString,
|
|
map VisibleString }
|
|
</pre>
|
|
|
|
but must be made unique in XML to be distinguished. To do so, we prefix
|
|
all element names with the name of the context structure:
|
|
|
|
<pre>
|
|
<!ELEMENT Person ( Person_name, Person_room-number )>
|
|
<!ELEMENT Person_name (#PCDATA)>
|
|
<!ELEMENT Person_room-number (#PCDATA)>
|
|
|
|
<!ELEMENT Gene (Gene_name, Gene_map)>
|
|
<!ELEMENT Gene_name (#PCDATA)>
|
|
<!ELEMENT Gene_map (#PCDATA)>
|
|
</pre>
|
|
While this is a default behavior, our tools allow omitting such prefixes
|
|
if needed - for example, when XML DTD was the original specification.
|
|
</p>
|
|
|
|
<hr />
|
|
<p>
|
|
Please email questions at:
|
|
<a href="mailto:info@ncbi.nlm.nih.gov">info@ncbi.nlm.nih.gov</a>
|
|
</p>
|
|
<p>
|
|
Last updated: Aug 26, 2005
|
|
</p>
|
|
</body>
|
|
</html>
|