The advantages of XML are already well articulated – it provides a way to organize data in a self describing manner (where the meaning of the data is clearly indicated by the element to which it belongs); it allows its data to be manipulated by standard libraries that are readily available on most platforms;  it is readable by humans; and it can be validated for well formedness as well as for semantic validity using standard tools. All of which contribute to making it easily the single most popular data interchange format in use today.

While the basic ideas behind XML are fairly easy to grasp, advanced XML usage can quickly get rather confusing. For instance, simply browsing through a WSDL document places you slam bang in the middle of the territory of schemas or namespaces.

So, in this post, I am going to review the key concepts that underlie the advanced use of XML. If you’re already well familiar with this area, I’ll meet you on the other side. Else, buckle in, and enjoy the ride.

Document Types

The documents you’ll see in the XML world fall into 2 broad categories – schemas and instance documents. The former category is used to define the legal structure of your XML documents, while the latter contains the documents that your applications will largely generate and/or consume.

In other words, if a schema is considered to be equivalent to a Java class definition, then the instance document is equivalent to an instance of that class.

XML Application

An XML application is a set of rules that define the structure of an instance document. For instance, SVG, MathML, and XHTML define a set of elements, as well as their attributes, relationships, and type rules.

In other words, an XML application is defined by one or more schemas that define what constitutes a valid XML document.

Namespaces

A key challenge with building an XML application is that since most applications are designed and developed independently of the others, it is not very uncommon for a tag name used by one application (e.g., <head>, <title>, or <msub>) to also be inadvertently used by another.

This can get hairy when a single XML instance document combines more than one XML application, resulting in a condition where a given element, such as <title>, may have a different meaning within the context of each application. In such a case, when the element <title> shows up in an instance document, its actual meaning may be quite unclear.

Namespaces and Qualified Names

So how does one prevent naming collisions?

The traditional answer is “using namespaces” and that’s the answer here too.

A namespace works by providing a unique space within which an application’s names can be defined. As the developer of the application it is your responsibility to ensure that you have acquired a globally unique namespace identifier. Then, you must ensure that the names of your components (elements, types, attributes, etc.) are unique within the application itself.

In other words, a component’s name is globally unique only when its namespace identifier and its local name within the application are taken together. This composite name is termed its “qualified name”.

This can be expressed as:

Qualified_Name = Unique_Name_Of_the_Applications_Namespace : Local_Name_of_the_Component.

But, how does one guarantee uniqueness of namespace identifiers?

The easiest option is to simply use the absolute URI of a domain that is owned by your organization.

For instance, I own the domain www.swengsol.com, and so I could choose to use the namespace, http://mynamespace.swengsol.com/contacts for my sample XML application that deals with contacts.

It is important to note that a URL (or even more generally a URI) is simply a convenient way to establish a globally unique namespace. The convenience comes from the fact that you have previously registered your domain with a central registration authority.

While a namespace might resemble a URL, it is not expected that this URL will resolve to a particular document, or even that the virtual host mynamespace.swengsol.com actually exists anywhere on the Net. Instead, this namespace is simply a logical identifier used to ensure that any component names I define within it are globally unique. To make sure you understand this concept, ask yourself why you should not use namespaces based on time.com or cnn.com for your own application.

So, if I were to define an element <organization> in my contacts XML application, the qualified name for that element would be: http://mynamespace.swengsol.com/contacts:organization, where http://mynamespace.swengsol.com/contacts is my unique namespace, and where organization is a local name that I guarantee is unique within my own namespace. Taken together, the qualified name is therefore guaranteed to be unique across the entire universe of XML component names.

Namespace Prefixes

Unfortunately, using a URI as a namespace is problematic because of the characters, such as “/”, that are legal within a URI, but are illegal within an XML instance document. The solution here is to restrict the URI to a safe location within the document (such as an attribute value), to map the URI to a legal logical name  for that namespace (e.g., “xsd”),  and then to use that logical name wherever the actual namespace would have been used.

So how do we associate the actual namespace name with its logical equivalent?

We use a namespace declaration.

A namespace declaration uses the xmlns attribute of any XML element to bind a given namespace to its logical equivalent. That binding is visible for that element and any of its descendants. If used on the root element (the usual case), it is available throughout that document.

<HTML:html xmlns:HTML="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"
 xmlns:C=”http://mynamespace.swengsol.com/contacts”>
 <HTML:head>
   <HTML:title>Contacts sorted by organization</HTML:title>
 </HTML:head>
 <HTML:body>
   <C:organization name="Software Engineering Solutions, Inc">
     <C:contact>
       <C:title>Director of Marketing</C:title>
       ...
       </C:contact>
       ...
   </C:organization>
 </HTML:body>
</HTML:html>

As you can see, elements from different namespaces (HTML:title and C:title) can coexist quite well within a single instance document – even if they share the same local name (title).

Default Namespace

The default namespace is a special case used to reduce the verbosity of an instance document. If a large percentage of your elements come from a single namespace, you can declare the mapping of that namespace to an empty logical namespace name. Any elements that do not belong to a named namespace now belong to this default namespace.

<html  xmlns =”http://www.w3.org/1999/xhtml”
       xmlns:C=”http://mynamespace.swengsol.com/contacts”>
  <head><title>Contacts sorted by organization<title><head>
  <body>
    <C:organization>
    ...

You could define the default namespace (or any other namespace) at a lower level within the document tree. In which case, the default namespace (or the other namespace) is overridden at that element and any of its descendants.

The default namespace does not apply to attributes. An attribute that is not prefixed does not exist in any namespace.

Target Namespace

This concept has meaning only when we discuss XML Schemas – so while I introduce it here, we won’t actually see it in more detail until a bit later.

A target namespace is declared within an XML Schema to indicate the namespace to which any types defined in that schema, belong.

XML Schemas

A schema provides you with a way of defining what is a valid and legal XML instance document for your XML application. It lets you define the structure of your document (which elements and attributes are permissible, and in what combinations), as well as the legal data types for your elements and attributes.

A schema’s root element is called schema, and is in the http://www.w3.org/2001/XMLSchema namespace (xsd prefix).

Content Model

The content model describes the content of an XML element. An element has a “simple” content model when it only contains a text node; a “complex” model when it can only take subelements; “mixed” when both can be present; and “empty” when no content is allowed.

A simple content model:

<name>Damodar Chetty</name>

A complex content model:

<name>
 <first>Damodar</first>
 <last>Chetty</last>
</name>

An element that can only take a simple content model and has no attributes is considered a simple type, while all others are considered complex types.

XML Simple Type

The XML Schema specification defines 44 simple data types in 4 main categories. This includes numeric data types such as integers (xsd:int, xsd:short, and xsd:long) and real numbers (xsd:float, xsd:double, and xsd:decimal); timestamp data types  such as a specific date (xsd:date), time (xsd:time) or length of time (xsd:duration); XML types such as an XML ID (xsd:ID) or an ID reference (xsd:IDREF); strings (xsd:string); booleans (xsd:boolean), a URI (xsd:any), and so on.

For a detailed reference, visit: http://www.w3.org/TR/xmlschema-2/.

In addition, you can extend these built-in simple types to derive new custom simple types. For instance, you can use a regular expression to restrict the values that are legal for a given type. You do this using the simpleType element and its restriction child which takes one or more facets that let you restrict the allowable values.

Allowable facets for the string data type include xsd:pattern which can be used to define a regex pattern that defines legal values; as well as xsd:minLength and xsd:maxLength which define the minimum and maximum length of the content.

<xsd:simpleType name=”us-zipcode”>
  <xsd:restriction base=”xsd:string”>
    <xsd:pattern value=”\p{Nd}{5}”/>
  </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType>
  <xsd:restriction base="int">
    <xsd:minInclusive value="0"/>
    <xsd:maxExclusive value="10000" />
  </xsd:restriction>
</xsd:simpleType>

Once these new types have been defined, you can use them to declare other elements.

<element type="inv:quantity"/>

XML Complex Data Types

In addition to simple types, you can also define your own complex data types that are composed of one or more simple types, or even from other complex types.

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd=”http://www.w3.org/2001/XMLSchema”
 xmlns="http://www.swengsol.com/contacts"
 targetNamespace="http://www.swengsol.com/contacts" >

 <xsd:element name="organization" type="organizationType" />

 <xsd:simpleType name=”us-zipcode”>
   <xsd:restriction base=”xsd:string”>
     <xsd:pattern value=”\p{Nd}{5}”/>
   </xsd:restriction>
 </xsd:simpleType>
 <xsd:complexType name="organizationType">
   <xsd:sequence>
     <xsd:element>
       <xsd:complexType name="contactName">
         <xsd:sequence>
           <xsd:element name="firstName"/>
           <xsd:element name="lastName"/>
         </xsd:sequence>
       </xsd:complexType>
     </xsd:element>
     <xsd:element name="phone"   type="xsd:string" />
     <xsd:element name="address" type="addressType" />
   </xsd:sequence>
   <xsd:attribute name="name" type="xsd:string"/>
 </xsd:complexType>
 <xsd:complexType name="address">
   <xsd:sequence>
     <xsd:element name="street"  type="xsd:string" />
     <xsd:element name="city"    type="xsd:string" />
     <xsd:element name="state"   type="xsd:string" />
     <xsd:element name="zip"     type="us-zipcode" />
   </xsd:sequence>
 </xsd:complexType>
</xsd:schema>

There are a few things to note with this schema:

1. The targetNamespace identifies the namespace into which the local components defined in this schema will be placed.

2. All unqualified component names are part of the default namespace.

3. Components that are named (using the name attribute) and that are defined directly under the xsd:schema document element within a schema, are called “global” components. In this schema, the organization, organizationType, and address components are global components. The visibility of global components is schema-wide. Global elements (organization) can be root elements for instance documents that conform to this schema.

4. The contactName element’s type definition is an anonymous component since the definition is local to its containing element. The use of anonymous elements limits reuse since it uses a type definition that is not named – preventing that definition from being used by any other element other than contactName. The contactName element is not a direct child of the schema element, hence it cannot be used as a document’s root element.

The benefit here is that you could define different content models for the same element. For instance <contactName> could be locally defined within one containing element to have first, last and middle sub elements; while when used within another element it could simply have a text node.

Global components have special properties:

  • they can be referenced from anywhere within the schema, as well as from another schema that may include or import this schema.  For instance, global types can be referenced in any element definition using the type attribute. New complex types can also be derived from these global definitions.
  • their names must be unique within a schema, hence you are limited to a single global element with a given name. If you need to reuse a name, you must define the component locally under the appropriate parent.
  • a global element definition can not only be referenced anywhere within the schema, but also can be used as root elements for your instance documents based on this schema.
  • Complex types used as building blocks must appear as top level complexType elements  in the schema.

Complex Data Type Composition

In general, complex types are comprised of simple types arranged in some compositional manner – either using an ordered sequence of elements (xsd:sequence), or unordered combinations (xsd:all or xsd:choice).

The sequence is the simplest construct, where you specify the order in which elements must appear, and for each element you specify its type as well as the number of times it is allowed to appear (using occurrence bounds). You can also add occurrence attributes to the sequence as a whole.

xsd:all defines an unordered grouping of one or more individual element declarations, where each element may only occur either 0 or 1 times.

xsd:choice allows an unordered grouping of one or more individual element declarations, where only one element from that group may appear in an instance document. The xsd:choice element itself can be bounded.

Attaching Schemas to Instance Documents

Most parsers can validate an XML instance document against a given schema to ensure that the instance document conforms to that given markup language. There are two ways in which to link an instance document to its defining schema. In both cases, the instance document has an embedded pointer that references its schema.

In the first mechanism, the instance document uses the xsd:noNamespaceSchemaLocation attribute on its root element to reference its schema document.

<?xml version=”1.0”?>
 <organization xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
 xsi:noNamespaceSchemaLocation=”contacts.xsd”>

In this case, the path to the XSD can be an absolute URL to somewhere on the Internet, or it can be a path relative to the instance document on the local hard drive. Note that you do not need to specify the location of the schema that defines the xsi namespace since the XMLSchema-Instance namespace is supported natively by any XML schema validating parser.

In the second mechanism, a separate xmlns:schemaLocation attribute is used in the document root element to map a namespace to its associated schema document. Whitespace is used to separate each namespace from its schema, and to separate namespace/schema pairs from each other. Any such use of whitespace is purely for readability.

<organization
 xmlns="http://www.swengsol.com/contacts"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.swengsol.com/contacts http://www.swengsol.com/contacts.xsd
 http://www.w3.org/2001/XMLSchema-instance http://www.w3.org/2001/XMLSchema.xsd">

For a referenced schema such as contacts.xsd to be useful, that schema’s targetNamespace must match a namespace that is used within this instance document.

Combining Schemas

Schemas support reuse of defined types and structures using standard mechanisms of imports and includes. An import lets you combine schemas from different namespaces, while an include lets you combine schemas from the same namespace.

In both cases, you need at least two schema definitions – where one of the schemas (the dependent) is being imported into or included by the other (the independent).

An include is the simpler operation, and is used simply as a composition mechanism to construct an overall schema out of individual portions. In this case, the target namespace advertised by the independent and dependent schemas must match exactly. This is appropriate since even though they are physically distinct, the schemas being composed are logically part of a single namespace.

<include schemaLocation="http://www.swengsol.com/contacts.xsd" />

The include mechanism is very straightforward, and can be considered a direct copy and paste into the independent schema.

While including is a way of composing parts of a single grammar into a single complete set, importing is a way of composing independent grammars into a single family of rules. In other words, each grammar in an import can stand by itself quite comfortably, and is only being imported in order to be used in a synergistic manner with another cooperating grammar. As a result, the namespaces of each grammar will be quite different, and there is no requirement that relates the targetNamespace of the two schemas. As a result, the importing element must specify not only the location of the schema being imported, but also the namespace to which it will be mapped – which should match the targetNamespace within the dependent schema.

<import namespace="http://www.swengsol.com/contacts"
        schemaLocation="http://www.swengsol.com/contacts.xsd" />

The imported schema will be assigned a prefix , usually using the xmlns attribute in the independent schema’s root element, before its rules can be used.

That’s it! This much XML knowledge is generally sufficient for most usage scenarios. I’ll follow this post with another that takes a closer look at the WSDL definition of web service contracts.

Updated July 16: I spoke too soon … I was informed that there’s one more concept that I should have covered in the above article.

Qualified and unqualified elements and attributes.

By default, any global elements used in your instance document must be fully qualified (either explicitly by using a namespace prefix, or implicitly by specifying a default namespace for that instance document). This is controlled by the elementFormDefault attribute of the schema root element which is set to “unqualified” by default.

You can explicitly set this attribute to “qualified” to indicate that even local elements must now be fully qualified (i.e., be prefixed by a namespace).

In a similar fashion, the schema root element’s attributeFormDefault attribute can also be set. If set to qualified, then both global as well as local attributes must be explicitly qualified. Note that the default namespace does not apply to attributes, and so there is no implicit qualification that occurs as with elements.

Note that you can override the defaults using the form attribute of the element and attribute schema elements.