Last modified: Monday, 20-May-2019 01:31:03 UTC. Maintained by: Elisa E. Beshero-Bondar (eeb4 at psu.edu). Powered by firebellies.

Introduction to XML

So. . . What exactly is XML anyway?

XML: Structured, Informational Markup

XML is short for eXtensible Markup Language,and it’s a standard system for storing and accessing information used practically everywhere around the world. It’s the informational markup (or code) that makes Microsoft Office and Blackboard software run, and it’s the foundation of many online network applications around the world. For our purposes as researchers, it’s an excellent method for storing information, and for preparing to share it with the public. XML is independent of proprietary software applications—which means that what you write in XML is freely exchangeable between computers of different kinds (across platform—as in Macs and PCs). It outlasts software obsolescence, because it’s a standard that can be read universally.

You’ve probably heard of HTML (hypertext markup language), which is the code that makes web pages presentable in web browsers. That’s a kind of Markup Language, too, designed specifically and only for presentation and publication on the world-wide web: HTML is for presentation and display, but XML is primarily for storage of information, and we can call it informational markup, as opposed to presentational markup. We can write code to take information written in XML and transform it for presentation online—and you’ll gain experience with doing that as we proceed with our class this semester.

XML is great for researchers in the Humanities and Social Sciences because it’s very effective at storing and cataloging information systematically. You can write XML to set up hierarchies (or nested structures) of information, and also to locate and extract that information later when you need it. So, if we were going to store a book in XML, we’d pay attention to the way it’s structured, maybe with chapters—and inside those chapters we’d have chapter titles, and paragraphs, and inside those, sentences, and then unit words and punctuation. If wanted to, we could systematically mark all the action verbs in those sentences, and all the exclamation points using XML, if this was important—and we could design a hierarchical system using XML to capture and hold the information we want to collect.

When we do research in the humanities, we’re working with documents written by human beings, and XML is useful for preserving them for reading and studying, and for extracting information from them later. We can do this close-up (through "close reading" by reading with our eyes, one by one. Or we can code documents systematically (which also involves close reading), in order to step BACK and view them from a distance: to let a computer discover patterns we couldn’t so easily see on our own. In Digital Humanities, this practice of working with computers to make them show us patterns across enormous, complicated texts or many, many texts, is called distant reading. XML helps us prepare texts for this, for two reasons:

XML is a formal model that represents an orderly structure—a hierarchy of information. To the extent that human documents are ordered in a systematic way, this can be represented and described using XML.
Computers work very quickly on orderly hierarchies of information. So if we model the documents we want to study as hierarchies in XML, this makes it ideal for us to use computers to count related things, help us find patterns.

We have to start by studying our documents to see how they’re structured, and identify what matters to us in describing a structure. This practice is called document analysis, and it’s basically what you’re doing when you have to make decisions about how to code a recipe, a voyage log, a poem, or a letter in our first XML exercises in this course.

XML is Nested Boxes, or a Tree

In technical terms, we can think of every XML document as a tree, sprouting from a single root, which contains and identifies the whole thing. That outermost layer is the start-tag and end-tag, the alpha and the omega of an XML file. I tend to think of this as a single box that contains everything else, with all its branching complexity inside.

<root>
<shoppingList>
<food>eggs</food>
<food>cheese</food>
<cloth>jeans</cloth>
</shoppingList>
<stores>
<vendor>Grocery Store</vendor>
<vendor>Clothing Store</vendor>
</stores>
</root>

XML marks a structure, or the hierarchy of a document, by using elements, such as shopping_list, and food_item. Each element consists of the following:

a start-tag
(whatever’s in between the tags)
an end-tag

break down of element structure

A start tag is defined with angle brackets, and an end tag looks like a start tag, except it has a forward slash after the opening angle bracket. When we refer to tags, we’re talking about those start and end tags. When we talk about an element, we’re referring to the whole thing: start-tag, CONTENT, and end-tag. Make sense?

Elements may also include something called attributes—an additional markup that gives supplementary information about an element. So, say we had ingredient names in French and Spanish in our shopping list, and we wanted to mark those: One option for doing this would be, say, like this:

<foreign language="French">escargot</foreign>

<foreign language="Spanish">sofrito</foreign>

See how this works: Attributes are written inside a start tag of an element (but NOT inside the end-tag). They consist of an attribute name and an attribute value. The attribute name, here, is language, and the attribute value is (you guessed it!) French or Spanish. (Attributes are sort of like adjectives, or descriptive modifiers!) Notice there’s a rule for HOW to write attributes: attribute values must be enclosed inside quotation marks—These can either be double straight quotation marks (") or single straight apostrophes (’). either one works, but try to use them consistently. Later on, when we’re writing other kinds of code that reads and extracts from your XML, you’ll find that you need to work with single quotes to refer to attribute values—more on that later. For now, as you write XML, double quotes are what we commonly use. Note that these are straight quotation marks (") and not the curly ones that you see in a word processor.

Self-closing elements

In special cases, XML elements can actually have no text content at all! These are called self-closing elements and they have a special syntax so that they open and close inside a single tag. Here is an example:

Here is one use-case for a self-closing tag: We are using it to contain information about where a line of poetry ends, because our XML markup would not make that clear. The line numbering is not literally in the poem we are coding, but we want to record the information about the line ending in the appropriate place:

            <poem> 
               I think that I shall never see<lb n="1"/>
               A poem lovely as a tree<lb n="2"/>
            </poem>

This shows us a use of markup that does not simply wrap around text, but stores information that will be useful to us later in processing the file. Note that we could also have chosen to code the lines like this:

            <poem> 
               <line n="1">I think that I shall never see</line>
               <line n="2">A poem lovely as a tree</line>
            </poem>

Both ways of coding the lines of poetry are correct, and might be used for different reasons. If we didn’t hold the information about the lines in some way, whether wrapping each line, or using self-closing tags at the end of the line, the computer parser would simply see an uninterrupted single line of text, with no notion of the meaningful nature of line breaks.

<poem>I think that I shall never see A poem lovely as a tree</poem>

Even though you have spaced this out with hard returns in your oXygen XML editor, to a computer parser, the text itself is a single undifferentiated string, because in XML hard returns and extra spaces appear as meaningless space and are not meant to be treated as stable formatting.

Usually we decide to write self-closing tags when we want to note simple pieces of information and where wrapping the text in open and close tags would actually cause a problem in nesting our elements. We will be discussing cases where we might want to use self-closing elements as we proceed in the course. Often they have to do with preserving well formedness, which we discuss in the next section.

Well Formed and Well Formedness in XML

XML must be well formed in order to be parsed by a computer. That means it must follow the syntax rules for writing XML: It must have a single root element, and its elements must be nested, without any overlap. Also, where attributes are used, these need to be signalled according to expected XML syntax (as above). These are necessary for the document to be XML. Well-formed XML is simply, correct XML.

The following example is NOT well-formed XML. Can you tell why not? (There are multiple reasons!)

<dairy>
<item>milk</item>
<item>yogurt</item>
</dairy>
<snacks>
<item>chips</item
<item>pretzels</item>
</snacks>

This is NOT well-formed XML either. Why not?

<paragraph>He responded emphatically in French: <emph><foreign language="french">oui</emph></foreign>!</paragraph>

Special Reserved Characters in XML

Computers (as well as people) need to be able to read XML and tell tags apart from text, to distinguish elements from their content. So, we run into formedness problems (problems with well-formed XML) when we want to represent certain characters, like left and right angle brackets AS text. What if you want to write, as I’m doing here, about code and you need to represent tagging AS text? View my page source, look for the example passages and you’ll see that I’ve used a work-around solution. What we need to do is escape the special characters (or the reserved characters) that indicate to a computer that these are tags. There are three special characters that we need to escape, and we do this by replacing them in with character entities which tell the computer to display these characters as text only. We must always escape three characters. We’ll show you in class how to do this:

&: must be replaced by &
<: must be replaced by <
>: must be replaced by >

Validity in XML: Based on Schema Rules

XML is adaptable and flexible for organizing information, because it is up to the person writing it how they want to define their elements, and what they want to call them. When people work in XML in communities, though, they’ll work with specific tagging conventions in order to easily connect and communicate with each other. For several of our projects, we’re working with one of those community languages with XML called TEI. TEI is both a community and a language within XML with a standard set of rules, called a schema. If you work together with a group on an XML project, one thing you’ll need to do is define your project’s schema (or work within a pre-existing schema like the TEI’s) so that you’re all coding consistently. When a project’s XML is correct according to its defined schema, we say that the XML is valid and we run what’s called a validity check to determine this, in which we check the XML against the project’s schema file. You’ll be learning a little later how to write your own schemas for XML using a language called Relax-NG, but for now, we’d like you to get used to the actual writing of XML code and to learn some key concepts about it: well-formedness, nesting hierarchy, working around special reserved characters, and validity.

The XML Comment: Writing Comments on your Code

One last thing: In real life, coders write comments to themselves and each other in a special way that sets those comments apart from the code they are writing. As you write XML to share with others (whether for turning in assignments to the instructors of this course or for sharing code with team members or interested people on the web) you can document the decisions you made by writing little messages designed to be ignored by the computer parsing your document, but meant for human readers. Here is how to write a comment in XML:

The only rule for writing comments is that you cannot use angle brackets or double hyphens inside the comment because the computer parser will not be able to tell where your comment starts and ends. It is excellent practice in coding to write messages in comments to remind yourself of a decision or alert someone you are working with about a question or a problem, and we strongly encourage such documentation on every piece of code you write.