By Jennifer Kyrnin
The latest specification of the Extensible Markup Language is available online
at the W3C. This specification
completely describes XML. But it can be fairly difficult to understand. In this
article, we will examine several parts of the XML specification in order
to understand the basics of an XML document.
Definitions
- characters
- a character is one unit of text, such as a letter, numeral, space, tab, and
other Unicode characters
- DTD
- Document Type Definition, the actual grammar of the XML document
- Document Type Declaration, the statement at the top of valid XML documents defining where to find the Document Type Definition
- entity
- a storage unit for the XML document. Each XML document consists of one
or more entities. For example, the HTML tag <html></html> defines
an entire html entity.
- XML
- Extensible Markup Language
- XML document
- a document that is well-formed as described in the XML specification
XML Documents
As mentioned in the definitions, an XML document is comprised of entities and is
well-formed if it conforms to the standards in the XML specification. There are
some basic aspects of an XML document.
- white space
XML treats white space (spaces, tabs, carriage returns) the way HTML does. One
or more white space character is treated as only one.
- character tags
XML uses the same characters as HTML for indicating tags and elements, specifically
<, >, and &. It also uses the colon (:) within XML names for
namespaces.
- other characters
Other ASCII and Unicode characters are taken as literal unless the DTD or other
element of the document redefines them.
- comments
XML also uses the same comment style you are familiar with in HTML <--
-->
- processing instructions
These are special tags created to contain instructions for applications. They
are indicated with <? and ?> tags
- CDATA
When you have a large block of XML code you would like to comment out quickly
or information you need to mark as data rather than actual code, you can use
the <![CDATA[ tag and end the section with ]]>
When you start an XML document, you should begin with an XML declaration that
indicates the version of XML used in the document. To write a valid XML document, you
must also have an associated document type definition (DTD) before the first element
in the document. Here is a sample XML document (that would be validated in a
validating parser):
<?xml version="1.0"?>
<!DOCTYPE firstxml SYSTEM "first.dtd">
<firstxml>
<greeting>Hello World!</greeting>
</firstxml>
The first line <?xml version="1.0"?> defines the version
of XML being used. If your XML document does not conform to the version specified
then your document has an error.
Line 2 <!DOCTYPE firstxml SYSTEM "first.dtd"> is the document type
declaration. It indicates the name of the DTD "firstxml" (and this is also the name of
the root element for the XML document) and identifies the URL of the DTD as
"first.dtd" which is found in the same directory as the XML document.
The third line of the document <firstxml> is the
root element of the document. It is named in the declaration line.
The fourth line of the document, <greeting>Hello
World!</greeting> is the actual XML. The <greeting> tag
was defined in the DTD, "first.dtd".
Finally, the last line in the document is the closing of the root element
</firstxml>
Confused Yet?
Because XML is so generic, it can get very confusing very quickly. Here are
some basic points to remember when starting XML:
- Each XML document should start with the version of XML you are using
<?xml version="1.0"?>
- The second line of your document should be the DTD or Document Type
Definition, this includes the name of your DTD and its URI or location
<!DOCTYPE mydocument SYSTEM "mydtd.dtd">
Note: if you do not need your document to be validated against a DTD, you may
omit this line. You would use different notation if you were going to validate
against an XML Schema document.
- The elements in an XML document are defined by < and >. XML is
case sensitive (ie. <greeting> and <GREETING> and <Greeting> are
three different entities), and I recommend that it be written in lower
case.
- Standard comments are just like HTML, <!-- -->, but cannot comment
out XML tags.
- To comment out XML tags, you need to use the CDATA tag
:
<greeting>Hello World</greeting>
<![CDATA[
This information is data for the XML document, but is ignored when it is
parsed.
<cdata_tag>even this tag is ignored</cdata_tag>
But the tag following the next line is again part of the XML code
]]>
<closing>Goodbye World!</closing>
Once you understand the basic aspects of an XML document, you're ready to
start creating your own.
Previous Features