Week 2: Creating XML Documents


 

Reading Assignments
BOOK PAGES
XML XML in easy steps CH 2

LINKS TO XML RELATED SITES

  1. XML.COM
  2. MSDN'S XML DEVELOPER SITE
  3. IBM'S XML WEBSITE
  4. IBM'S ALPHAWORKS WEBSITE
  5. W3 Schools XML Tutorial
  6. Kickstart XML Tutorial

YOUR FIRST XML DOCUMENT

In our first week coding XML you will build a simple address book, with 5 or so records, and use a total of 20 elements. It will follow the rules of XML syntax to be well formed, and with our DTD next week, be valid as well. The file is already built, so I'm hoping that you will modify this one, or embark on your own theme.

  1. Your first xml file The first thing that we need to do is create an XML file, save it and the open it in IE5.5 (or IE 6). Use any favorite text editor (BBEdit or Notepad) to type your xml document. In our file we will type <name> </name> and then we save the file. In an XML document the extension of the file needs to be ".xml". After saving the file, go ahead and open it with IE5 (or IE 6), as you would open an HTML file. MSIE is trying to show it understands XML by displaying the tags in color. Why does it display "<name />" ? We will talk about that later.
  2. Adding more You just saw how simple it was to create and view an XML document without having really talked about any of the specifics yet. In this file, we have made a more elaborate XML document. We've actually started to build out the address_book.xml file. Later on, we will explain some of these concepts.
COMMENTING
  1. Comments One of the first concepts a person should learn about a language, is the commenting syntax a language uses. The XML comments should be familiar to you, as it is almost the same as HTML syntax of commenting. (Actually these are SGML comments, as you will read later in your book)
  2. Except... You can not use "--" inside the comments. If you like to have long lines of comments as easy visual markers,
    "<-- -----------SOME INFO--------- -->" you can not use "--" use "=="
    instead: "<-- ============SOME INFO=========== -->"
  3. Some possible Alternatives Use any of these commenting lines.
WELL-FORMED XML DOCUMENT
  1. Root Element For a document to be well-formed it must follow the XML recommendation. Basically a document must have a root element. As an example "<HTML>" is the root element in an HTML document."<address_book>", is our root element. The next guideline is the order in which to open and close.
  2. One and only one... root element is allowed. A document can not contain two or more root elements. In this example there are two instances of the "<address_book>"</address_book>". You can take a look at the source code of this document if you need to see the code itself. Your browser should return an error.
  3. FOLE First Open Last End. This is a play on the programming term FILO (First In Last Out). Another guideline which an XML document needs to meet is the order of the end elements, FOLE. Basically, you are required to preserve the symmetric of element nesting. The first opening element is the last closing element. Or close last what you opened first, and close first what you opened last. Whatever. The best example is the root element; it is the first element and it is last ending element.
  4. Do not try this at home In this simple example, the order of "</names>" and "</name>" were switched. This would cause any XML parser to choke. Actually, you will get a polite error statement about a mismatch, including the line and position number of the first "offense". For those of you using Dreamweaver to code your XML (never too early), the yellow tag error marker will also highlight the mismatch.
  5. alpha under case-no "xml" An XML tag must start with an alphabetic character (a-z), or underscore(_), and the tags are case-sensitive. An XML tag can not begin with the character combination of "xml"- it's a reserved name (we'll talk about reserved names later). By the way....
  6. Case matters! Unlike HTML, XML is fully case-sensitive. In this example the closing element for "<record>", "</record>" was replaced by "</Record>". To your computer, ASCII characters are all that matter. An upper case "A" is no more similar to a lower case "a" than it is to "z". Try to stay lower case as a good habit.
ANATOMY OF AN XML DOCUMENT
  1. Three Segmental Structure An XML document can be broken up into three sections. Prolog, Root Element, and Epilog. An well formed and valid XML document (version 1) will have a "prolog" and root element.
  2. Processing Instruction The <? xxxxx ?> is a special syntax referred to as processing instruction. The most typical example is the XML declaration which indicates to the parser the version of XML this document would adhere to (<?xml version="1.0"?>). Processing instructions are information that the document would provide to the XML parser. What we have been following is the specification of XML version 1.0. In the future there might be additional changes to the XML specification that might grossly vary from the 1.0 recommendations. For upward compatibility it's a great idea to include the version of XML that your document adheres to. In the future we will see different uses for the processing instruction. Notice that the comment containing the "PROLOG" is no longer there. It's because the XML declaration needs to be the first information that the IE5 parser needs to see. The epilog can contain more comments and other processing instructions.
MORE ON XML
  1. Attribute Like HTML, the elements of an XML document can support attributes. Attributes extend an element's capacity of structuring a document by packing additional information about that element.
  2. Entity References An Entity is an instruction that the XML parser would substitutes after parsing the document. Entities are not new at all. The page you are viewing is using the entity "&lt;" to display the "<". This is because the HTML tag, and XML tags as well, are indicated by being encompassed within "< >" characters. Therefore, in this case it would be impossible to display them on the screen without the parser first parsing them. The XML language has 5 built in entity references (&lt; <), (&gt; >), (&amp; &), (&apos; &apos;), and (&quot; "). These entity references are derived from SGML, hence their appearance in HTML.
  3. CDATA? PCDATA? So what type of stuff can be placed within an element? Here we come across this CDATA PCDATA stuff. For starters, any text placed within the elements are by default of type Parsed Character DATA (PCDATA). This means the data will be parsed by the XML parser. In contrast to PCDATA would be the plain old Character DATA (CDATA), data that is not parsed by the parser. As you can remember in our Entity example we had to use the &lt; characters to encode the text HTML to make the parser replace it with <. But, CDATA (data) does not get parsed so there is no reason to use any entities. If you are very detailed you should also notice that there is some white space within the HTML and BODY in the choice "c". The white space in CDATA is preserved since the parser never parses this data and therefor the white space is not converted to a single white space as it would normally would.
PUTTING IT TOGETHER
  1. Empty Elements If you have had any HTML experience (of course you do) you have seen cases in which an opening tag and a closing tag is not really needed. One of the best cases is the image tag, <IMG src="some url" >. The paragraph tag <P>, bold <B>, break <BR> are other examples. Since, XML requires every element to have an ending element, a special syntax was devised to handle, "empty tags". The syntax of an empty tag would be, "<IMG src="some url" />" , <br />. When I send email, I often use <snip> to divide sections for meaning. Technically I should use <snip />. Sorry, oops, <sorry />.
  2. To Nest or Not to Nest One of the main questions that everyone has to go through as they design an XML document is to use nesting or to use an attribute. If you need the hierarchical data structure you should nest (non-empty elements) otherwise attributes are a bit more efficient to parse. Some types of information lend themselves better if they are organized in a hierarchical order rather than sequential. However, the W3C has said the performance should not be a criteria in design of XML. Nesting adds organization, bu tbe wary of overnesting (too many levels). In general, most 'ontologies' will have three levels, so for an XML data structure, a data model with three levels is (usually) just fine.
  3. Mixed elements - Think of empty and nested as hot and cold water. Most of us use 'warm' water, which we produce by mixing some hot and some cold. Plumbing only comes in hot and cold, so that's where we start. The *majority* of human designed XML code is in a nested format, and the *majority* of machine generated code is empty. The code I use (in biological applications) is mostly empty with some nested, and a fair amount of 'mixed' character. Take a close look at the six example files below, they will help you understand each style.
  4. Ordinal counting - now this is where it gets a little tricky. Before you read any further, make sure that you have looked at the six files (above). Some data models have an element of 'counting' to them. For instance, in a recipe, you might have ingredients and steps. What you want to avoid doing is having numbers or any notion of counting in an element. Elements like <ingredient_1>, <ingedient_2>, <ingedient_3>, and <step_1>, <step_2>, and <step_3> are such an example. If you find yourself 'counting' in your element names, in either the nested or the empty model, you probably should be using attributes, or 'mixed' character in your nested and empty models. Take a look at recipe_counting.xml . That's the *wrong* way to do it. Now take a look at recipe_attributes.xml - that's the *right* way to do it. If you have a model where there is a notion of order, try looking at these files: recipe_attributes_nested.xml and recipe_attributes_empty.xml. Lastly, if you *really* want to explore a complicated model, take a look at the knitting_patterns folder.

EXAMPLES

Students may also want to look at address_book_mixed.xml and download the entire Address_Book.zip archive.

HOMEWORK


Copyright © 2006 - 2007 Robert D. Cormia - September 17, 2006