Week 2: Creating XML Documents
|
|
| Reading Assignments
|
| BOOK |
PAGES |
| XML XML in easy
steps |
CH 2 |
|
|
|
|
LINKS TO XML RELATED SITES
- XML.COM
- MSDN'S
XML DEVELOPER SITE
- IBM'S
XML WEBSITE
- IBM'S
ALPHAWORKS WEBSITE
- W3
Schools XML Tutorial
- Kickstart
XML Tutorial
YOUR FIRST XML DOCUMENT
In our first week coding XML you
will build a simple address book, with 5 or so records, and use a total of 20
elements. It will follow the rules of XML syntax to be well formed, and with
our DTD next week, be valid as well. The file
is already built, so I'm hoping that you will modify this one, or embark on
your own theme.
- Your first
xml file The first thing that we need to do is create an XML file, save
it and the open it in IE5.5 (or IE 6). Use any favorite text editor (BBEdit
or Notepad) to type your xml document. In our file we will type <name>
</name> and then we save the file. In an XML document the extension
of the file needs to be ".xml". After saving the file, go ahead and open it
with IE5 (or IE 6), as you would open an HTML file. MSIE is trying to show
it understands XML by displaying the tags in color. Why does it display "<name
/>" ? We will talk about that later.
- Adding more
You just saw how simple it was to create and view an XML document without
having really talked about any of the specifics yet. In this file, we have
made a more elaborate XML document. We've actually started to build out the
address_book.xml file. Later on, we will explain some of these concepts.
COMMENTING
- Comments
One of the first concepts a person should learn about a language, is the commenting
syntax a language uses. The XML comments should be familiar to you, as it
is almost the same as HTML syntax of commenting. (Actually these are SGML
comments, as you will read later in your book)
- Except... You can not use "--" inside the comments. If you
like to have long lines of comments as easy visual markers,
"<-- -----------SOME INFO--------- -->" you can not use "--" use "=="
instead: "<-- ============SOME INFO=========== -->"
- Some possible
Alternatives Use any of these commenting lines.
WELL-FORMED XML DOCUMENT
- Root Element
For a document to be well-formed it must follow the XML recommendation. Basically
a document must have a root element. As an example "<HTML>" is the root
element in an HTML document."<address_book>", is our root element. The
next guideline is the order in which to open and close.
- One and only
one... root element is allowed. A document can not contain two or more
root elements. In this example there are two instances of the "<address_book>"</address_book>".
You can take a look at the source code of this document if you need to see
the code itself. Your browser should return an error.
- FOLE First
Open Last End. This is a play on the programming term FILO (First In Last
Out). Another guideline which an XML document needs to meet is the order of
the end elements, FOLE. Basically, you are required to preserve the symmetric
of element nesting. The first opening element is the last closing element.
Or close last what you opened first, and close first what you opened last.
Whatever. The best example is the root element; it is the first element and
it is last ending element.
- Do not try
this at home In this simple example, the order of "</names>" and
"</name>" were switched. This would cause any XML parser to choke. Actually,
you will get a polite error statement about a mismatch, including the line
and position number of the first "offense". For those of you using
Dreamweaver to code your XML (never too early), the yellow tag error marker
will also highlight the mismatch.
- alpha under
case-no "xml" An XML tag must start with an alphabetic character (a-z),
or underscore(_), and the tags are case-sensitive. An XML tag can not
begin with the character combination of "xml"- it's a reserved name (we'll
talk about reserved names later). By the way....
- Case matters!
Unlike HTML, XML is fully case-sensitive. In this example the closing element
for "<record>", "</record>" was replaced by "</Record>".
To your computer, ASCII characters are all that matter. An upper case "A"
is no more similar to a lower case "a" than it is to "z".
Try to stay lower case as a good habit.
ANATOMY OF AN XML DOCUMENT
- Three Segmental
Structure An XML document can be broken up into three sections. Prolog,
Root Element, and Epilog. An well formed and valid XML document (version 1)
will have a "prolog" and root element.
- Processing
Instruction The <? xxxxx ?> is a special syntax referred to as processing
instruction. The most typical example is the XML declaration which indicates
to the parser the version of XML this document would adhere to (<?xml version="1.0"?>).
Processing instructions are information that the document would provide to
the XML parser. What we have been following is the specification of XML version
1.0. In the future there might be additional changes to the XML specification
that might grossly vary from the 1.0 recommendations. For upward compatibility
it's a great idea to include the version of XML that your document adheres
to. In the future we will see different uses for the processing instruction.
Notice that the comment containing the "PROLOG" is no longer there. It's because
the XML declaration needs to be the first information that the IE5 parser
needs to see. The epilog can contain more comments and other processing instructions.
MORE ON XML
- Attribute Like HTML, the elements of an XML document can
support attributes. Attributes extend an element's capacity of structuring
a document by packing additional information about that element.
- Entity References
An Entity is an instruction that the XML parser would substitutes after parsing
the document. Entities are not new at all. The page you are viewing is using
the entity "<" to display the "<". This is because the HTML tag,
and XML tags as well, are indicated by being encompassed within "< >"
characters. Therefore, in this case it would be impossible to display them
on the screen without the parser first parsing them. The XML language has
5 built in entity references (< <), (> >), (&
&), (' '), and (" "). These entity references
are derived from SGML, hence their appearance in HTML.
- CDATA? PCDATA?
So what type of stuff can be placed within an element? Here we come across
this CDATA PCDATA stuff. For starters, any text placed within the elements
are by default of type Parsed Character DATA (PCDATA). This means the data
will be parsed by the XML parser. In contrast to PCDATA would be the plain
old Character DATA (CDATA), data that is not parsed by the parser. As you
can remember in our Entity example we had to use the < characters to
encode the text HTML to make the parser replace it with <. But, CDATA (data)
does not get parsed so there is no reason to use any entities. If you are
very detailed you should also notice that there is some white space within
the HTML and BODY in the choice "c". The white space in CDATA is preserved
since the parser never parses this data and therefor the white space is not
converted to a single white space as it would normally would.
PUTTING IT TOGETHER
- Empty Elements
If you have had any HTML experience (of course you do) you have seen cases
in which an opening tag and a closing tag is not really needed. One of the
best cases is the image tag, <IMG src="some url" >. The paragraph tag
<P>, bold <B>, break <BR> are other examples. Since, XML
requires every element to have an ending element, a special syntax was devised
to handle, "empty tags". The syntax of an empty tag would be, "<IMG src="some
url" />" , <br />. When I send email, I often use <snip> to
divide sections for meaning. Technically I should use <snip />. Sorry,
oops, <sorry />.
- To Nest or
Not to Nest One of the main questions that everyone has to go through
as they design an XML document is to use nesting or to use an attribute. If
you need the hierarchical data structure you should nest (non-empty elements)
otherwise attributes are a bit more efficient to parse. Some types of information
lend themselves better if they are organized in a hierarchical order rather
than sequential. However, the W3C has said the performance should not be a
criteria in design of XML. Nesting adds organization, bu tbe wary of overnesting
(too many levels). In general, most 'ontologies' will have three levels, so
for an XML data structure, a data model with three levels is (usually) just
fine.
- Mixed elements
- Think of empty and nested as hot and cold water. Most of us use 'warm' water,
which we produce by mixing some hot and some cold. Plumbing only comes in
hot and cold, so that's where we start. The *majority* of human designed XML
code is in a nested format, and the *majority* of machine generated code is
empty. The code I use (in biological applications) is mostly empty with some
nested, and a fair amount of 'mixed' character. Take a close look at the six
example files below, they will help you understand each style.
- Ordinal
counting - now this is where it gets a little tricky. Before you
read any further, make sure that you have looked at the six files (above).
Some data models have an element of 'counting' to them. For instance, in a
recipe, you might have ingredients and steps. What you want to avoid doing
is having numbers or any notion of counting in an element. Elements like <ingredient_1>,
<ingedient_2>, <ingedient_3>, and <step_1>, <step_2>,
and <step_3> are such an example. If you find yourself 'counting' in
your element names, in either the nested or the empty model, you probably
should be using attributes, or 'mixed' character in your nested and empty
models. Take a look at recipe_counting.xml
. That's the *wrong* way to do it. Now take a look at recipe_attributes.xml
- that's the *right* way to do it. If you have a model where there is a notion
of order, try looking at these files: recipe_attributes_nested.xml
and recipe_attributes_empty.xml.
Lastly, if you *really* want to explore a complicated model, take a look at
the knitting_patterns folder.
EXAMPLES
Students may also want to look
at address_book_mixed.xml and download
the entire Address_Book.zip archive.
HOMEWORK
Copyright © 2006 - 2007 Robert D. Cormia -
September 17, 2006