XML Tutorial: Getting Started with XML

banner

By Tim Buchalka for Udemy

Check out some of Tim’s other courses on JavaAndroid development, and Python.  Tim also operates the Learn Programming Academy website.

What is XML?
Presenting XML

Transforming/Styling XML
XSLT Processing

Parsing XML

SAX and DOM Parsers

Parsing XML in Java

Parsing in Java using DOM
Parsing in Java using SAX

Parsing XML in Python

Parsing in Python using DOM
Parsing in Python using SAX


What is XML?

XML stands for eXtensible Markup Language and was derived from SGML (Standard Generalised Markup Language), an extension of a General Markup Language created by IBM. The first standard for SGML was published in 1986, but SGML is very complex and not at all easy to use.

XML was created in the 1990s to be a simpler, but still very powerful, version of SGML suitable for use on the World Wide Web.

Tim Berners-Lee created HTML as a way to present documents in a standard form, but HTML was primarily concerned with presentation rather than content. Although it is possible to extract meaning from an HTML document, it’s not easy and does rely on knowledge of the content of the document. XML is focused on content or meaning rather than presentation, and an XML document describes its content so that no prior knowledge of the document is required.

Although it’s difficult to see it these days – with all the fancy stuff web developers do with scripting, with the advent of CSS to separate presentation from content and with HTML5 allowing even more impressive things to be done – HTML itself was pretty restrictive. It provided structure to documents: a title, various levels of headings and the ability to include tables, as well as the ability to format text and include images, but the fundamental unit of information exchange was the entire document. As use of the internet took off and more and more systems were required to communicate with each other, the restrictions of HTML led to the development of a markup language that was easy to use and flexible. Thus, XML was born.

The specification of XML is more precise than HTML, and introduces the concept of a “well-formed” document. HTML was fairly relaxed; for example, tags do not need to be closed, and tags can overlap. Consider this example:

<strong><em>This text is bold and underlined.</strong></em>

This would be invalid in XML: tags are not allowed to overlap, so the </em> tag to end emphasis would have to appear before </strong>.

All tags in XML must also be closed. In HTML, the <p> tag to denote a paragraph can be parsed successfully even if it isn’t closed (which it often isn’t), and the <br> tag didn’t even have a closing version (until HTML5). In XML, a line break would have to be specified as either <br></br> or as a self-closing tag <br />.

The various HTML specifications up to HTML5 define exactly which tags will be recognized. In contrast, XML does not define any tags. Programmers are free to create any tags they want to use.

Presenting XML

As an example, consider the XML below that could be used to represent events in a calendar. Each event has a name, a date and time, and also a duration. Although you wouldn’t present the data in this raw format, it is easily readable.


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="SampleXML.xsl"?>
<Calendar>
   <Event>
       <EventName>Dentist</EventName>
       <StartDate>2015-10-12</StartDate>
       <StartTime>09:30:00</StartTime>
       <Duration units="hours">1.5</Duration>
   </Event>
   <Event>
       <EventName>Carnival</EventName>
       <StartDate>2015-10-17</StartDate>
       <StartTime>09:00:00</StartTime>
       <Duration units="days">2</Duration>
   </Event>
</Calendar>

Every XML document should start with an XML Declaration specifying the version, and optionally an encoding and whether the document is standalone or not. Actually, that’s not strictly accurate; the declaration was not required in the XML 1.0 specification, but XML 1.1 does require a declaration. If a declaration is not present, then the document is assumed to be version 1.0.

There must be only one root element in a well-formed XML document; <Calendar> is our root element here.

Elements may contain other elements or values, or both. Our event elements contain the name, date and time, and duration. EventName contains only text naming the event, and the StartDate and StartTime contain, not surprisingly, the date and time, respectively.

The Duration elements are interesting because they contain additional information in what is called an attribute. A number for a duration is only useful if you know what that number represents, so a “unit” attribute is used to specify hours or days. Note that in XML, attribute values MUST be enclosed in quotes even if the attribute value is numeric; in HTML that was not a requirement and quotes were often not used around numbers.

Transforming/Styling XML

One of the cool things we can do with XML is transform it using an XSL stylesheet (also called a transform). XSL stylesheets are themselves written in XML (technically, we are using XSLT, eXtensible Stylesheet Language Transformation, which is one of the XSL languages available. We’ll be looking at another, XPath, in a later article).

Save the following file in the same directory as the earlier XML example, and we’ll look at how to apply it to our XML.


<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="2.0"
               xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <xsl:template match="/Calendar">
       <html>
           <body>
               <xsl:for-each select="Event">
                   <xsl:choose>
                       <xsl:when test="Duration/@units = 'days'">
                           <h2 style="background-color:red"><xsl:value-of select="EventName"/></h2>
                       </xsl:when>
                       <xsl:otherwise>
                           <h2><xsl:value-of select="EventName"/></h2>
                       </xsl:otherwise>
                   </xsl:choose>

                   <h3><xsl:value-of select="StartDate"/></h3>
                   <h3><xsl:value-of select="StartTime"/></h3>

                   <xsl:value-of select="Duration"/>

                   <xsl:value-of select="concat(' ',  Duration/@units)"/>

               </xsl:for-each>
           </body>
       </html>
   </xsl:template>
</xsl:stylesheet>

It is possible to apply a transform to XML in code, without editing the original XML (useful if the XML is obtained from a web server), but for this example we’ll just make a slight change to our XML file. Add the following below the XML 1.0 declaration line:

<?xml-stylesheet type="text/xsl" href="SampleXML.xsl"?>

…making sure you use the file name that you used when saving the stylesheet in the href.

You should now be able to open the XML file in your browser and, provided you are using a modern browser, it should display quite nicely. Note that Google Chrome will not display styled XML from a file, though. That’s not because it isn’t modern but because of the way they have chosen to guard against security concerns with downloaded web pages.

image00

Actually, it’s a bit hideous; however, there’s no reason we couldn’t have included a <head> section with a link to a normal CSS file to style the generated HTML and produce something a lot more stylish – but this is an article on XML, not HTML, and hopefully that demonstrates that we can completely transform our XML into something suitable for presentation.

The XSL transform is able to take the contents of the XML document and enclose them in HTML elements to present the dataset in a much better format than the raw XML. Although this is a simple example, we can take any structured data and represent it as XML without losing the ability to present it in a format suitable for the internet.

The XSL transform starts with an XML declaration (as it is XML), then either an xsl:stylesheet or xsl:transform element. These are completely synonymous; you can use whichever you prefer and their content is identical apart from the word stylesheet or transform. In this element we specify the namespaces that we will be using; we are only using the W3C’s XSL namespace, so that’s the only one we’ve included.

Within a transform, there can be one or more template rule declarations to transform the XML; these start with xsl:template and include a match attribute that determines how much of the document the template will act upon. We have specified /Calendar so we can access anything within the Calendar element. It is common to see match=”/”, which refers to the root of the document (which is technically defined by the xsl:stylesheet or xsl:transform element). Had we done that, we would have had to refer to /Calendar/Event in the for-each element later.

We are now free to mix HTML tags, text, and XSLT elements to format the data set as we wish.

After including the normal HTML and body tags, we use a for-each element to operate on every Event element contained within our match (Calendar).

The syntax of the conditional block is a bit different. XSL does have an <xsl:if> element, but there is no else. If you want to provide alternatives to a condition, then you have to enclose the entire conditional part in <xsl:choose> and use <xsl:when> in place of if and elseif, with <xsl:otherwise> taking the role of a more usual else clause. We’ve just used it to change the background of any event that lasts for entire days to be red; obviously, if we were laying out something like a Google Calendar entry, we would indicate that this was a full-day event in a nicer way, but the principle holds.

In most programming languages, you would probably just write the <h2...> tag here, and leave off inserting the actual EventName and the closing tag until outside the condition so that you do not duplicate code. However, an XSL transform is XML and XML does not allow tags to overlap. Thus, the <h2…> tag must be closed before we can close the <xsl:when> or <xsl:otherwise> tags.

If the code in each conditional part were more complex, you could avoid duplicating it by calling another template in XSLT 1.0 or using a function element in XSLT 2.0.

The remainder of the template uses value-of elements to place the remaining data items within <h3> tags. We also write the units for the Duration by referencing the units attribute; this is done by using an @ character after the Duration element name. So, Duration/units would refer to an element called units that was contained within a Duration, whereas Duration/@units refers to an attribute of the Duration element.

Finally, we close all tags that were opened to keep the document well-formed.

XSLT Processing

In fact, XSL transforms can manipulate the data in far more complex ways, including performing arithmetic on the elements as the following XML and associated XSL transform demonstrate. Save the xml as invoice_items.xml, and save the transform in the same directory with the name invoice_items.xsl. Incidentally, the reason for using the same directory is down to Firefox’s way of mitigating the potential security issue we mentioned earlier. Both Opera and Firefox ensure that local files cannot access pages on the web, but Firefox goes a step further and also restricts them to only accessing files in the same folder or a subdirectory. You can find more information on this here.

Invoice items XML file


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="invoice_items.xsl"?>
<Invoice>
   <Number>A151012-001</Number>
   <Date>2015-10-12</Date>
   <Item>
       <PartNo>DF1845-G</PartNo>
       <Description>Wifi router</Description>
       <Quantity>2</Quantity>
       <UnitPrice>39.95</UnitPrice>
   </Item>
   <Item>
       <PartNo>AS6391-B</PartNo>
       <Description>Computer chassis - black</Description>
       <Quantity>1</Quantity>
       <UnitPrice>55.00</UnitPrice>
   </Item>
</Invoice>

Invoice items transform file


<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="2.0"
               xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <xsl:template match="/Invoice">
       <html>
           <body>
               <h1>Invoice</h1>
               <table border="1">
                   <tr><td>Date</td><td><xsl:value-of select="Date"/></td></tr>
                   <tr><td>Invoice number</td><td><xsl:value-of select="Number"/></td></tr>
               </table>

               <p>
                   Terms strictly 30 days. Thank you for your order
               </p>

               <table border="1" style="border-collapse:collapse">
                   <th>Part No</th>
                   <th>Description</th>
                   <th>Quantity</th>
                   <th>Price</th>
                   <th>Total</th>
                   <xsl:for-each select="Item">
                       <tr>
                           <td><xsl:value-of select="PartNo"/></td>
                           <td><xsl:value-of select="Description"/></td>
                           <td><xsl:value-of select="Quantity"/></td>
                           <td><xsl:value-of select="UnitPrice"/></td>
                           <td><xsl:value-of select="Quantity * UnitPrice"/></td>
                       </tr>
                   </xsl:for-each>
               </table>

               <p>
                   Your order consists of <xsl:value-of select="sum(Item/Quantity)"/> items.
               </p>
           </body>
       </html>
   </xsl:template>
</xsl:stylesheet>

Opening the XML in a browser should produce the following:

image02

We’ve already discussed most of the xsl elements in the Calendar example. Of interest here is the fact that we can perform basic arithmetic: multiplying Quantity by UnitPrice to get the total cost of each item line.

We also have access to a range of functions defined in the xsl namespace which we can use in the value-of elements; here we have used sum() to calculate the total number of items ordered, and in the Calendar example we used concat() to ensure that a space appeared between the duration and its units.

Calculating a grand total for the whole invoice is also possible, but it is not trivial and involves either extension elements or recursion and is beyond the scope of this introduction.

Parsing XML

Now that we have seen how XML is very good at representing structured data in a form that allows us access to the content, and also allows presentation of the data by means of XSL transforms or stylesheets, it’s time to look at how we can interrogate and manipulate an XML dataset programmatically.

For these examples, we’ll use both Python and Java. If you want to learn more about either of these languages, please check out my Python and Java courses on Udemy.

The examples were created using IntelliJ IDEA, which is an excellent IDE for Java and Python (and many other languages). It also makes a very impressive XML editor, supporting stylesheets as well.

SAX and DOM Parsers

We’ll start with an example using the W3C DOM (Document Object Model) before having a look at parsing XML using SAX (Simple API for XML).

These 2 approaches to parsing XML are very different. If you are familiar with the concept of a DOM in HTML, then it works pretty much the same with XML. The entire XML document is loaded into memory in a tree structure representing the document.

A SAX parser, on the other hand, reads the document as it is parsing it but does not actually store any data. Callback methods are invoked when significant events occur during the parsing (for example, if the parser sees a new tag, then it will trigger an event and call a callback method if you have defined one). It is up to your code to store any values and build up any relationship that you need among the elements that are being parsed.

This will all become a lot clearer in practice.

Parsing XML in Java

Parsing in Java using DOM

We’ll use our Calendar XML file as the source document, but the builder.parse method can also be given a URI or an Input stream as the source of its XML.

We start by creating a builder factory that is needed to create a builder object. It is this Document Builder that parses the XML and returns a DOM document.

If we are able to get a builder object, we pass our XML to its parse() method. As mentioned earlier, the source of the xml can be a file on the local file system, a URI, or an input stream (from a HttpURLConnection, for example).

If the parse method returns successfully, we now have a DOM document that we can interrogate or manipulate as we chose.

In the example, the error handling is far from comprehensive. The document factory’s newDocumentBuilder method and the builder’s parse method both throw exceptions that must be handled, but in these examples we have done no more than print the stack trace.

Once we have our DOM in memory, we can loop through the elements, return the values and check the attributes of any element. The code is fairly straightforward and uses getElementsByTagName to return a NodeList containing all the Event elements from the tree. We then iterate through this list and print out the child elements of each Event, once again retrieving each child element by its name (EventName, StartDate, etc).

public class Main {

   public static void main(String[] args) {
       DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
       DocumentBuilder builder = null;
       try {
           builder = factory.newDocumentBuilder();
       } catch (ParserConfigurationException pe) {
           pe.printStackTrace();
       }

       Document document = null;
       if (builder != null) {
           try {
               document = builder.parse(new File("SampleXML.xml"));
           } catch (SAXException sae) {
               sae.printStackTrace();
           } catch (IOException e) {
               e.printStackTrace();
           }
       }

       if (document != null) {
           Element root = document.getDocumentElement();
           System.out.println(root.getNodeName());

           NodeList elements = document.getElementsByTagName("Event");

           for (int i = 0; i < elements.getLength(); i  ) {
               Element element = (Element) elements.item(i);
               System.out.println("Event name:\t"  
                       element.getElementsByTagName("EventName").item(0).getTextContent());
               System.out.println("\tStart date:\t"  
                       element.getElementsByTagName("StartDate").item(0).getTextContent());
               System.out.println("\tStart time:\t"  
                       element.getElementsByTagName("StartTime").item(0).getTextContent());

               Node duration = element.getElementsByTagName("Duration").item(0);
               System.out.println("\tDuration:\t"   duration.getTextContent());
               System.out.println("\t\t"   duration.getAttributes().item(0));
           }
       }
   }
}

If you are using an editor that does not automatically handle imports for you, you’ll need the following:

import org.w3c.dom.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.IOException;

Before running the program, remember to set the full path and filename of your XML file in the call to parse: document = builder.parse(new File(“SampleXML.xml”));

The DOM Tree is shown in the diagram below:

image01

There can be many Event elements, but to keep the diagram simple, only the detail for the first one has been shown.

Our code relies on the fact that we know the structure of our XML. Thus, when printing the elements of an Event, we referred to .item(0) rather than iterating through the list of items; we could do this because we know that an Event only has one of each child element (EventName, StartDate, StartTime and Duration).

Because we know the structure of our Calendar document, this approach works fine, but we may have to deal with XML whose structure is unknown. In that case, you would not be able to specify the element names and could not use getElementsByTagName.

However, the Element object supports methods such as hasChildren() and hasAttributes(), so we could write a more general-purpose routine that prints what it can about the current element before moving on to any child elements.

We’ll create a method parseChildren to display all of an element’s child nodes without having to know what they are called, or even how many there are.

In order to do this, we will need to make recursive calls to parseChildren from within parseChildren itself. If you haven’t used recursion before, it is a very powerful tool for dealing with structures that can be defined recursively and although it seems complicated, it’s really quite straightforward once you understand how it works.

When you call a method from within another method, the current state of the calling method is saved onto the call stack; Java does this by storing what’s called a stack frame, which contains all the method’s local variables and their values. When the called method returns, the stack frame for the method to be returned to (the calling method) is retrieved from the call stack and execution continues.

A recursive call behaves in exactly the same way, and once it has recursed all the way down the tree, the calls return and the previous state of the method is restored from the stack, effectively moving back up the tree.

The parseChildren() method takes a NodeList as a parameter, and as we can see from the DOM Tree diagram above, a NodeList can contain other NodeLists. Our Calendar NodeList contains Event elements, and each Event element contains a NodeList consisting of the child elements of an Event. When parseChildren reaches an element that itself has children, parseChildren is called again with a NodeList containing those child nodes. If any of those nodes also have children, the method will call itself again, and so on. When there are no children, the method returns to the parent and so on, back up the stack.

Once thing to be careful of with a recursive method is overflowing the call stack. If the depth of recursion (i.e., the number of children that themselves have children) is too great, then there will not be enough memory allocated for the call stack and the program will crash.

Looking at our tree, if we start with the Calendar element, then the maximum depth of recursion (including Calendar) is 4, so overflowing the stack is not going to be a problem. The depth of recursion is not determined by how many elements are in the DOM tree, but by how deeply the elements are nested, and it would be a very unusual XML document that nested elements hundreds deep.

To make it easier to see what’s happening, parseChildren increases a variable called depth each time it is invoked, and decrements it on exit. This allows us to build up a string of tabs to indent the output, which will give an idea of the level of recursion and, hopefully, make the output easier to read.

Add the parseChildren method below, then modify the main method to call it instead of using the for loop.


private static void parseChildren(NodeList elements) {
       depth++;
       String tab = "";
       for (int i = 0; i < depth; i++) {
           tab = tab + "\t";
       }
       for (int i = 0; i < elements.getLength(); i++) {
           Node node = elements.item(i);
           System.out.println(tab + "Node Name = " + node.getNodeName() + "; TextContent = " +
                                            node.getTextContent());

           if (node.hasAttributes()) {
               NamedNodeMap attributes = node.getAttributes();
               for (int n = 0; n < attributes.getLength(); n++) {
                   Node attr = attributes.item(n);
                   System.out.println(tab + "Attr name : " + attr.getNodeName() + "; Value = " +
                                                    attr.getNodeValue());
               }
           }

           if (node.hasChildNodes()) {
               parseChildren(node.getChildNodes());
           }
       }
       depth--;
   }

main() should be changed to look like this:


public class Main {
   public static int depth = 0;

public static void main(String[] args) {
       DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
       DocumentBuilder builder = null;
       try {
           builder = factory.newDocumentBuilder();
       } catch (ParserConfigurationException pe) {
           pe.printStackTrace();
       }

       Document document = null;
       if (builder != null) {
           try {
               document = builder.parse(new File("SampleXML.xml"));
           } catch (SAXException sae) {
               sae.printStackTrace();
           } catch (IOException e) {
               e.printStackTrace();
           }
       }

       if (document != null) {
           Element root = document.getDocumentElement();
           System.out.println(root.getNodeName());

           NodeList elements = document.getElementsByTagName(root.getNodeName());
           parseChildren(elements);
       }
   }

Notice that we are no longer passing the name of our Event elements to getElementsByTagName when creating the first NodeList; instead, we use the name of the root – which, as we are assuming no knowledge of this XML, we obtain using root.getNodeName().

The only changes we have made to main() are to add a static int depth to keep track of our depth of recursion so that we can indent the output; we’ve also changed the parameter to getElementsByTagName() so that we are passing the root of the DOM:

NodeList elements = document.getElementsByTagName(root.getNodeName());

…and we also replaced the for loop with a call to parseChildren.

One oddity of this DOM is that an element contains an empty Text Node (called #text) for each of its child elements, which is why we see extra lines of the form:

Node Name = #text; TextContent =

with the content being a blank line. This is caused by our having put line breaks and indenting in the source XML. We could either reformat the XML so that it does not contain formatting, or ignore nodes whose type is TEXT_NODE (see later) and that contain nothing but white space.

Note that if a node has child elements, getTextContent() includes the values of the child elements as well as any text node that it may contain. This makes the output look a little odd, but we can change that by replacing getTextContent() with getNodeValue(). Now the output shows null for the elements and displays the text when it processes the #text child nodes.

System.out.println(tab + "Node Name = " + node.getNodeName() + "; Value = " + node.getNodeValue());

Once you get the hang of the Document Object Model, processing one is not too complicated even if you have no knowledge of the contents of the source XML.

If you want to experiment further, the node objects have other interesting methods, and it can be useful to print out node.getParentNode().getNodeName() to also show the parent.

There are also methods getFirstChild() and getLastChild() to obtain the first and last child nodes, and getNextSibling() and getPreviousSibling() to allow navigation across the tree rather than down it. We have seen that the NodeList objects have a getLength() method to return how many nodes they contain, so counting the number of child nodes is also easy.

It may be useful to know what type of node has been found, and the Node object has a getNodeType() method that can return one of the following values (taken from the Node.java file):

public static final short ELEMENT_NODE = 1;
public static final short ATTRIBUTE_NODE = 2;
public static final short TEXT_NODE = 3;
public static final short CDATA_SECTION_NODE = 4;
public static final short ENTITY_REFERENCE_NODE = 5;
public static final short ENTITY_NODE = 6;
public static final short PROCESSING_INSTRUCTION_NODE = 7;
public static final short COMMENT_NODE = 8;
public static final short DOCUMENT_NODE = 9;
public static final short DOCUMENT_TYPE_NODE = 10;
public static final short DOCUMENT_FRAGMENT_NODE = 11;
public static final short NOTATION_NODE = 12;

The first four types are probably the ones of most interest at this stage, and we have seen examples of the first three. CDATA is similar to text and is used, for example, if the text contains content that could be interpreted as XML even though it isn’t. The actual definition is “a section of element content that is marked for the parser to interpret as only character data, not markup.”

Now that we’ve seen how to interrogate XML using the DOM, we’ll have a look at a totally different approach, the event-driven approach of SAX.

Parsing in Java using SAX

Whereas the DOM parser loads the entire document into memory, SAX works by reading the XML and raising events when something interesting happens, such as a new tag or its end tag being found. This lets us respond to these events and examine the current element.

One advantage of this approach is that it is very fast and computer memory does not become an issue. Instead of being presented with an entire document, the only thing in memory is the current element.

A disadvantage is that it is up to our code to maintain any relationship between the elements if we need to. We’ll see an example of maintaining state when we ensure that we use the correct <name> element later.

To see an event-based parser in action, we’ll create a class that extends the default handler. This will provide all the methods that we want to be called when the various events are triggered. Create a new class, call it MyHandler, and paste in the following:


import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class MyHandler extends DefaultHandler {

   private int depth = 0;
   private String tab;

   @Override
   public void startDocument() throws SAXException {
       System.out.println("startDocument called");
   }

   @Override
   public void endDocument() throws SAXException {
       System.out.println("endDocument called");
   }

   @Override
   public void startElement(String uri, String localName, String qName, Attributes attributes)
                            throws SAXException {
       depth++;
       tab = "";
       for (int i = 0; i < depth; i++) {tab = tab + "\t";}

       System.out.println(tab + "startElement called with " + qName);
       if (attributes != null) {
           for (int i = 0; i < attributes.getLength(); i++) {
               System.out.println(tab + "\tAttribute " + attributes.getQName(i) +
                                  " has value " +attributes.getValue(i));
           }
       }
   }

   @Override
   public void endElement(String uri, String localName, String qName) throws SAXException {
       System.out.println(tab + "endElement called with " + qName);

       tab = "";
       for (int i = 0; i < depth; i++) {tab = tab + "\t";}
       depth--;
   }

   @Override
   public void characters(char[] ch, int start, int length) throws SAXException {
       System.out.println(tab + "characters called with " + new String(ch, start, length));
   }
}

As we will see, a default handler is passed to the parse method.  The methods we are overriding will be called by the parser when the corresponding event occurs – so endElement() will be called when the parser encounters an end tag in the XML document, for example.

Our Main class is quite simple, and very similar to the one we used with a DOM parser. We once again use a factory to create our parser, but this time it’s a SAXParserFactory object.

Next, we create an instance of the MyHandler class and pass it, together with the XML file name, to the parse() method.

We could have used an anonymous class instead of creating MyHandler, but for this example, it was clearer to discuss them separately.


import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;

public class Main {

   public static void main(String[] args) {
       SAXParserFactory factory = SAXParserFactory.newInstance();
       SAXParser parser = null;

       try {
           parser = factory.newSAXParser();
       } catch(ParserConfigurationException | SAXException e) {
           e.printStackTrace();
       }


       DefaultHandler defaultHandler = new MyHandler();

       try {
           if (parser != null) {
               parser.parse("SampleXML.xml", defaultHandler);
           }
       } catch(IOException | SAXException e) {
           e.printStackTrace();
       }
   }
}

If you get an error that Multi-catches are not supported, either make sure you are compiling to at least Java 7, or split the catches up.

When you run main(), you should see the various methods being called as the parser encounters the various tags and values in the XML.

Whenever one of our handler methods is called, the only thing it is aware of are the parameters passed to it; there is no provision for navigating a tree as we could with the DOM parser.

Thus, if we want to process an element’s attributes, we have to do so when startElement() is invoked for that element; if we don’t, we will not get another chance without starting parsing from the beginning.

In the implementation of the various handler methods given here, we don’t do very much other than print a message identifying which method is being called together with its most useful parameter(s).

startElement() also checks if the attributes parameter is not null, and iterates through the collection printing each attribute’s name and value.

We maintain a tab depth in startElement and endElement in a similar way as we did in the DOM example’s parseChildren, purely to make the output easier to read.

It’s worth mentioning the characters() method, as this is where we are able to pick up the contents – or value – of an element. There is no equivalent to the DOM’s getTextContent() or getNodeValue() methods; instead, our overridden characters() method is invoked when the text contents have been parsed. The parameters to characters() are a char array, a starting position, and a length. It’s no coincidence that these correspond to the parameters required by one of the constructors of the String class, making converting the input parameters into a string quite easy. Once again, remember that there is no going back and if our program needed to store the node names and values, then it would have to explicitly do so – perhaps adding objects to an ArrayList, for example.

We’ll have a look at parsing a much larger dataset; a good one to introduce some complexity is the iTunes Store Top Songs feed at http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/topsongs/limit=10/xml.

It’s worth pointing a browser at that URL and then viewing the page source to get a feel for the XML that we will be downloading. The elements we will be capturing are “im:name” and “im:artist” within the “entry” elements. If you examine the XML carefully, though, you will see that there is also an im:name tag nested in an “im:collection” element for each entry. That would be easy to deal with using a DOM parser but requires a little more thought with our SAX parser.

We’re going to simplify the MyHandler class by stripping out all the diagnostic output and tab counting, etc., and we’ll also remove the unused variables and methods before adding some functionality to allow parsing of the top “n” songs on iTunes. If you get problems, it may be worth putting the diagnostics back into the methods we are using.

As we have to store the values we are interested in, we’ll create a TopSong class to hold the song title and artist name and store all the songs in an ArrayList.

The TopSong class is simple and looks like this:

public class TopSong {
   private String title = null;
   private String artist = null;

   public TopSong() {
   }

   public void setTitle(String title) {
       if (this.title == null) {
           this.title = title;
       } else {
           this.title = this.title + title;
       }
   }

   public void setArtist(String artist) {
       this.artist = artist;
   }

   @Override
   public String toString() {
       return this.title + " by " + this.artist;
   }
}

The setter for the title field is a little unusual; it allows us to keep appending to the title using the setter. The reason is that titles that contain apostrophes are split by this parser into several text nodes at the apostrophe. It’s a minor inconvenience, and we just keep appending entries as we find them.

The modified MyHandler class contains a few variables to store the songs and also to track state as we parse.

We need to know which element we are currently dealing with, and most of the relevant methods have qName (qualified name) as a parameter. Unfortunately, characters() does not. So unless we keep track of the element we are in when characters() is called, we will have no idea what to do with the text. We use the String variable currentElement to track this, and make sure to set it to an empty string whenever endElement() is called.

Bearing in mind the (normal) requirement to track what element is being parsed, and the (slightly unusual) one of watching out for a duplicate element name within the collection element, startElement is pretty simple. Store the element name; if this is a new entry element, then create a new TopSong instance to hold the song details, and if we are entering an im:collection element, set a flag so we can ignore any data in it.

The endElement() method is also simple. If we are leaving an entry element, then we have collected all the data we need and can add our topSong object to the ArrayList. If the element is an im:collection, then clear our flag so we once again start processing im:name elements. Finish by clearing the currentElement variable, as we no longer have a current element.

The characters() method checks if the text relates to either an im:name or im:artist element, and sets the corresponding field of a topSong object if there is an object. However, it only does this if we are not within an im:collection element – thus ignoring the spurious im:name elements contained in the collection.

Once everything has been parsed, the endDocument() method is called, so this is an ideal place to print out (or otherwise process) the songs that we have stored. Obviously, the exact nature of the processing will vary depending upon the application; we could, if appropriate, implement a callback interface and notify the MyHandler object’s parent that processing is complete, passing it the data that we have collected.

Here’s the MyHandler class:

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import java.util.ArrayList;

public class MyHandler extends DefaultHandler {

   private TopSong topSong = null;
   private ArrayList topSongs = new ArrayList<>();
   private String currentElement = null;
   private boolean inCollection = false;

   @Override
   public void endDocument() throws SAXException {
       System.out.println("iTunes Store top songs");
       System.out.println("-------------------------");
       for (TopSong song : topSongs) {
           System.out.println(song.toString());
       }
   }

   @Override
   public void startElement(String uri, String localName, String qName, Attributes attributes)
                            throws SAXException {

       currentElement = qName;
       if (qName.equals("entry")) {
           // Make sure we have a topSong object available.
           topSong = new TopSong();
       } else if (qName.equals("im:collection")) {
           inCollection = true;
       }
   }

   @Override
   public void endElement(String uri, String localName, String qName) throws SAXException {
       if ((qName.equals("entry")) && (topSong != null)) {
           topSongs.add(topSong);
           topSong = null; // clear our object ready to instantiate another one if necessary.
       } else if (qName.equals("im:collection")) {
           inCollection = false;
       }

       currentElement = "";
   }

   @Override
   public void characters(char[] ch, int start, int length) throws SAXException {
       String textContent = new String(ch, start, length);

       if ((topSong != null) && (!inCollection)) {
           if (currentElement.equals("im:name")) {
               topSong.setTitle(textContent);
           } else if (currentElement.equals("im:artist")) {
               topSong.setArtist(textContent);
           }
       }
   }
}

The change to Main is trivial; we just pass to the parse method the iTunes URL instead of a file name:

parser.parse("http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/topsongs/limit=10/xml", defaultHandler);

Once you are satisfied that it works, change the limit in the URL from 10 to 100. Everything should still work fine. To save you time trying to find the upper limit allowed by Apple, it’s 200; any value greater than that will give an error.

Parsing XML in Python

The standard Python libraries contain a number of XML parsers; the one we will use here is ElementTree. From Python 3.3, it automatically uses the cElementTree implementation (which is written in C rather than Python and is significantly faster) but will fall back to the Python implementation if running on a platform that won’t support the C accelerator.

Parsing in Python using DOM

The Python equivalent of our first Java example to read the Calendar XML will look something like this:

import xml.etree.ElementTree as ElementTree

tree = ElementTree.parse("SampleXML.xml")
root = tree.getroot()
print(root.tag)

for node in root:
   print("\tNode Name: " + node.tag)
   for subnode in node:
       print("\t\tNode Name = {}; TextContent = {}".format(subnode.tag, subnode.text))
       if subnode.attrib:
           print("\t\tAttr Name : units; Value = ", subnode.attrib['units'])

Or, in Python 2:

try:
   import xml.etree.cElementTree as ElementTree
except ImportError:
   import xml.etree.ElementTree as ElementTree

tree = ElementTree.parse("/home/jproberts/Documents/TimsCourses/BlogArticles/SampleXML2.xml")
root = tree.getroot()
print(root.tag)

for node in root:
   print("\tNode Name: " + node.tag)
   for subnode in node:
       print("\t\tNode Name = %s; TextContent = %s" % (subnode.tag, subnode.text))
       if subnode.attrib:
           print("\t\tAttr Name : units; Value = %s" % (subnode.attrib['units']))

Note that we are making sure the cElemetTree implementation is available, as this is not done automatically with versions before Python 3.3.

All subsequent examples will use Python 3, but should be easy to convert if you prefer Python 2.

Once again, we are relying on knowledge of the structure of the Calendar XML, so we don’t have to go further down than the subnodes.

An element’s attributes are returned in a standard Python dictionary, so we just provide the key “units” to extract the value.

ElementTree provides a tree iterator which iterates over the entire tree, starting at the root. The second Java example to print the elements when we have no knowledge of the structure is therefore very easy in Python:

import xml.etree.ElementTree as ElementTree

tree = ElementTree.parse("/home/jproberts/Documents/TimsCourses/BlogArticles/SampleXML2.xml")

for node in tree.iter():
   print(node.tag, node.text)
   if node.attrib:
       for k, v in enumerate(node.attrib):
           print("\tAttr Name : {}; Value = {}".format(v, node.attrib[v]))

Parsing in Python using SAX

ElementTree includes an iterparse() function that behaves very similarly to a SAX parser; it’s more of a hybrid, but it is fast and very easy to use. Here’s the SAX example in Python using ElementTree:

import xml.etree.ElementTree as ElementTree

depth = 0

for event, element in ElementTree.iterparse("SampleXML.xml", events=("start", "end")):
   if event == "start":
       depth += 1
       tabs = '\t' * depth
       print(tabs + "started " + element.tag)
       if element.attrib:
           for k, v in enumerate(element.attrib):
               print(tabs + "\tAttr Name : {}; Value = {}".format(v, element.attrib[v]))
   else:
       print(tabs + "\t{} Value: {}".format(element.tag, element.text))
       print(tabs + "ended " + element.tag)
       depth -= 1
       element.clear()

When calling iterparse() we provide the source and a sequence of events. Four events are supported; we are using “start” and “end”, but “start-ns” and “end-ns” are also available if you need detailed namespace information.

iterparse() returns an iterator providing tuples of events and elements, so we can check to see which event we have and respond accordingly.

A start event may be emitted as soon as the final “>” of the starting tag has been seen, so there’s no guarantee that the contents will be available; that’s why we don’t attempt to display the contents until we get the “end” event. As we must have the attributes by the time the start tag’s “>” has been processed, we can display the attributes in “start”.

If you don’t specify the events, the default is that only “end” events are returned.

Finally, when we get an “end” event, the element is cleared. The ElementTree iterparse() function is a hybrid and does return an ElementTree object. To gain the memory benefit of a SAX type parse, we get rid of each element when we are done processing it.

The last Java example was parsing the song title and artist from the iTunes top songs feed, which is quite simple now that we have seen how iterparse() works. To keep things simple – and because it’s such a basic class – the TopSong class is defined in the same Python file.

import urllib.request
import xml.etree.ElementTree as ElementTree


class TopSong(object):
   title = ''
   artist = ''

   def __init__(self):
       """ no initialisation required
       """

   def set_title(self, title):
       self.title = title

   def set_artist(self, artist):
       self.artist = artist

   def to_string(self):
       return self.title + " by " + self.artist

url = "http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/topsongs/limit=10/xml"
inCollection = False
topSongs = []

xml = urllib.request.urlopen(url)
for event, element in ElementTree.iterparse(xml, events=("start", "end")):
   if event == "start":
       if element.tag == "{http://itunes.apple.com/rss}collection":
           inCollection = True
       elif element.tag == "{http://www.w3.org/2005/Atom}entry":
           currentSong = TopSong()
   else:
       if element.tag == "{http://itunes.apple.com/rss}collection":
           inCollection = False
       elif element.tag == "{http://itunes.apple.com/rss}artist":
           currentSong.set_artist(element.text)
       elif element.tag == "{http://itunes.apple.com/rss}name" and not inCollection:
           currentSong.set_title(element.text)
       elif element.tag == "{http://www.w3.org/2005/Atom}entry":
           topSongs.append(currentSong)
       element.clear()

for song in topSongs:
   print(song.to_string())

The output from this should match what we got from the Java code above. Again, once it is working, you can increase the limit in the URL; 200 is the maximum.

The actual download is performed by importing urllib.request and using the urlopen function, although iterparse will accept a file object as the source for its XML so you could perform the download in some other way.

Just as with the Java implementation, we have to track the collection element and ensure that we do not collect the name from there; this is done by setting inCollection to True when we see the collection start tag and to False when the end tag is seen.

A new TopSong instance is created when we enter an “entry” tag, and appended to the list when the entry element’s end tag appears.

Songs are stored in a list, similarly to the approach taken with the Java code.

Because there is no end document event, we print the list once we have finished iterating through all the events returned by iterparse().

I’ve said that XML can be parsed without understanding the contents of the document, but in our calendar example, this is clearly not the case. The StartDate values could be interpreted as year, month, day, or year, day, month, for example – e.g., although I know that the dental appointment starts on Saturday, October 12th, 2015, it could start on December 10th. The W3C XML Schema defines a number of data types, including dates, times and durations.

To ensure error-free transfer of data using XML, it is possible to remove any ambiguity by specifying the type of the elements using a schema.

If we are going to use the W3C duration type, we have to specify our duration in a conformant form “PnYnMnDTnHnMnS”. The capital letters are literals, and are used as delimiters. We must start with P; all the others follow the value they are delimiting and can be omitted if the value they relate to is not used. Thus, one and a half hours would become “P1H30M”, and a whole day would be “P1D”. We can also now remove the units from our durations as the duration type specifies all that is required; however, attributes that describe units can still be very useful if working in miles and transferring data to countries that work in kilometers, for instance.

Save the following as SampleXML1.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Calendar xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:noNamespaceSchemaLocation="file:///SampleXML1.xsd">
	<Event>
		<EventName>Dentist</EventName>
		<StartDate>2015-10-12</StartDate>
		<StartTime>09:30:00</StartTime>
		<Duration>PT1H30M</Duration>
	</Event>
	<Event>
		<EventName>Carnival</EventName>
		<StartDate>2015-10-17</StartDate>
		<StartTime>09:00:00</StartTime>
		<Duration>P1D</Duration>
	</Event>
</Calendar>

and the schema below in the same directory as SampleXML1.xsd:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">

<xs:element name="Calendar">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="Event" maxOccurs="unbounded">
        <xs:complexType>
          <xs:sequence>
            <xs:element name="EventName" type="xs:string"/>
            <xs:element name="StartDate" type="xs:date"/>
            <xs:element name="StartTime" type="xs:time"/>
            <xs:element name="Duration" type="xs:duration"/>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>
</xs:element>
</xs:schema>

By using a strict schema, any ambiguity is removed and the XML dataset becomes truly portable and can be understood by any system.

We haven’t covered XML validation, but tools exist to validate XML against a schema to ensure that it conforms, and this would normally be performed before attempting to process the dataset.

To see this in action, you can validate your xml and schema at http://www.utilities-online.info/xsdvalidation/. The xml and schema can be checked independently, and then the xml is validated against the schema. Try removing the century digits from the date, for example, and when you re-validate, you will get an error.


There is a lot more to XML than could be covered here. We haven’t touched on using code to generate the XML, for one thing, and XML validation is a much bigger subject than we have had time for. Hopefully, though, this tutorial will have given you a taste of how XML can be used for representing structured data without sacrificing the ability to present it in an attractive format, as well as showing how XML can be processed to retrieve meaning and content from XML data sets.