NewtFire logo: a mosaic rendering of a firebelly newt
newtFire {dh|ds}
Authored by: Rob Spadafore (spadafour at gmail.com) Edited and maintained by: Elisa E. Beshero-Bondar (ebb8 at pitt.edu) Creative Commons License Last modified: Friday, 26-Feb-2016 01:29:57 EST. Powered by firebellies.

As we learned in the Relax NG tutorial, we write and associate schema to constrain the content of an XML document. This helps if you are working with many complex files or trying to coordinate a team of coders to maintain consistency across an entire project. Relax NG is a grammar-based schema language, which means that it defines the hierarchical relationship of elements and attributes in an entire document from its starting root to all its branches. It may seem like Relax NG ought to be able to govern everything we need, but there are certain kinds of constraints that it can’t handle. For these we apply a rule-based schema to function alongside our grammar-based schema in order to fine-tune precise relationships among elements and attributes. We work with Schematron, a rule-based constraint language that uses XPath expressions to assert or report on the presence or absence of patterns. Rule-based schema languages like Schematron typically do not constrain every element and attribute like our Relax NG Schemas. Instead, when we write Schematron, we usually concentrate on just a few things that we need to control very precisely, as we will show you here.

Relax NG and Schematron are commonly used together. For example, let’s say we are collecting data from 100 people and want to record their votes for their favorite ice cream flavor: vanilla, chocolate, or strawberry. Limiting our attributes to those three flavors and defining the responses as integers would not be difficult using Relax NG. But what if, instead of 31 votes for chocolate, I accidentally entered 131 votes? A basic Relax NG schema that defines the element vote this way vote = element vote {type, xsd:integer} and type = attribute type {"chocolate" | "vanilla" | "strawberry" } wouldn’t catch any problems with the specific numbers I enter, because the data type for integer is not something we can set to specific numerical values in relation to a total. If we want to make sure that the numerical values of all <vote> elements add up to 100, Schematron is the tool we need. More generally, we use Schematron if we need to define rules that assert relationships in the content of our elements and attributes, such as (among other things) to make sure that the preceding-sibling::header does not contain the identical text of a following-sibling header, to check that elements holding page number values appear in the correct order, or to flag every time we are missing a punctuation mark that is supposed to appear inside a sentence element.

Superstructure and namespaces (the stuff at the top of the document)

When you open a new Schematron file in <oXygen/>, you will see the following superstructure:

<?xml version="1.0" encoding="UTF-8"?>
	<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
	xmlns:sqf="http://www.schematron-quickfix.com/validator/process">

</sch:schema>

We will first add some namespace information that will dictate how we represent the elements in a) the Schematron document we are writing, and b) the XML document it will constrain if that XML document is in a special namespace. We typically set the Schematron namespace as a default. (Without this line, we would have to type sch:, a namespace prefix , in front of all of our Schematron elements, so we really prefer to use it.) Paste the line bolded in red below into your new Schematron:

<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
	xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
	xmlns="http://purl.oclc.org/dsdl/schematron">

</schema>

If the XML document(s) you’re trying to constrain are in a specific namespace, such as the TEI, you must identify that namespace with an empty element called <ns/>, and you will also have to use a namespace prefix when representing the XML elements in your schema rules. The next box shows how to define the TEI namespace and its special namespace prefix. If you are writing Schematron to govern TEI XML and you don’t define your namespace, or if you forget to use a prefix to point out the elements that belong to that namespace, the Schematron’s rules simply will not fire when you associate it with your TEI document(!)

<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
   		xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
   		xmlns="http://purl.oclc.org/dsdl/schematron">
   		<ns uri="http://www.tei-c.org/ns/1.0" prefix="tei"/>

</schema>

About namespaces: Documents are in a namespace or in no namespace, as signaled in their root element. We can see in the code above that a Schematron document has a special xmlns (or XML namespace) attribute that seems to point to a web address. This is not really a website (though sometimes developers put up placeholder websites at namespace URIs): it's simply a unique uniform resource identifier (that is what URI stands for) and it is simply a unique string of characters used to identify the Schematron namespace. The TEI has its own namespace URI too, and so do other forms of XML (like XSLT) that we are presenting in this course. If your input document is in the TEI namespace (that is, the root element is <TEI xmlns="http://www.tei-c.org/ns/1.0">, you have to include the <ns uri="http://www.tei-c.org/ns/1.0" prefix="tei"/> element we illustrated above in your Schematron and you must use the tei: prefix before all references to elements (but not attributes) from the input TEI document in your Schematron file. That means you need to write //tei:body/tei:div and not just //body/div. Attributes are special because they exist in no namespace, so they do not take a prefix (and you will not be able to find them if you apply the prefix). So if we are looking for @ref attributes on TEI <div> elements, we would write: //tei:body/tei:div/@ref. You can think of this as a magic incantation that’s needed for Schematron to match just the elements in the TEI document, but if you’d like more explanation of how namespaces work, see http://www.w3schools.com/xml/xml_namespaces.asp.

The skeleton of a Schematron rule

Pattern, Rule, and Context

Each new schema rule starts with a <pattern> element. Inside the <pattern> is a <rule> element with an @context attribute. It looks like this:

<pattern>
	<rule context=" ">

	</rule>
</pattern>

We can set as many rules as we wish inside a pattern element, which simply works as a convenient organizing structure for you to put related rules together. A pattern element may contain one or multiple rule elements as you wish. A rule element must have a @context attribute that is distinct from other rule elements in your Schematron file. The value of @context is the specific place in your XML document where the rule applies. (When you have associated your Schematron file with your XML and do validation checking in <oXygen/>, the XPath pattern defined by your @context is where <oXygen/> will mark a validation error triggered by a test of your Schematron rule.) The form this takes is called an XPath pattern and we also use it in XSLT: it is a pattern of elements and/or attributes set in relation to each other that might appear at any level of your document hierarchy: For example, if you write the XPath pattern p/said as the value of @context, rule context will apply to any <said> elements within a <p> element positioned at any level in the XML document hierarchy, whether the parent p element is sitting inside a TEI header in an outer level of the hierarchy or deeply nested inside a note element inside another body p. XPath pattern expressions let us locate particular patterned relationships wherever they sit in the document hierarchy so they can be a powerful tool to keep our Schematron and XSLT code tidy and efficient. Why is this more efficient? Because we do not have to write the same rule for said elements over and over again depending on the different XPath positions of p, and we may save computer parsing time by not starting our searches over and over again from the document node were we to begin with //p/said. Constructing an XPath pattern, p/said takes advantage of the relational patterns that rule-based schema languages are designed for. XPath patterns can also be set to use predicates, so that, for example, said[@who] matches on any <said> elements that have @who attributes anywhere they are sitting in our XML document.

Assert or Report

The <assert> or <report> element is the heart of each Schematron rule. Within each <rule> element we can set one or more <assert> or <report> elements, which contain an attribute called @test. With all of these pieces together, here is the basic skeleton of a Schematron rule using <assert>:

<pattern>
	<rule context=" ">
		<assert test=" "> </assert>
	</rule>
</pattern>

The value of @test is a literal XPath statement defined in immediate relation to the current XPath location of @context, wherever this is. The @test sets a condition for the True or False value of something you write here: For example, does particular string pattern exist here? Does the numerical value of this equal the preceding::sibling of the current context? Imagine the current context to be shifting with each discovery of the XPath pattern. As the validation checker lands on each new instance, it runs your @test and checks for some condition, true or false, that hinges on that pattern in some way. Basically, @context tells <oXygen/> where to look, and @test tells <oXygen/> what to test when it gets there. You then type a message, your very own customized validation error message, inside the <assert> or <report> element as its text content, and explain (to yourself and/or your project team) the reason the rule is firing. When a rule fires, it will generate an alert message in <oXygen/> just like a message from Relax NG, although in Schematron, it’s your own custom-made message that fires.

Writing the rules

An assert rule

Okay, now that we understand the structure, let’s construct some sample rules so we understand how and why they function. Let’s say you’re keeping track of points in a game where the goal is to get as many points as possible. The person in first place got 23 points, second place got 16, and third place got 12. Let’s construct a basic XML document to store the results:

<gameResults>
	<first>23</first>
	<second>16</second>
	<third>12</third>
</gameResults>

In our very simple example, the first place score should always be more points than the second place score. Let’s write a Schematron rule to make sure the values are entered correctly. First, let’s start by writing the <pattern>, <rule>, and @context. We want the rule to fire (or alert the user) on the <gameResults> element.

<pattern>
	<rule context="gameResults">

	</rule>
</pattern>

Now, we want to write the rule. We want to assert (or say definitively) that the first-place score must always be greater than the second-place score. This means that the rule will fire when the defined assert test fails.

<pattern>
	<rule context="gameResults">
		<assert test="number(first) gt number(second)">The first-place score must be greater than the second-place score.</assert>
	</rule>
</pattern>

When we associate our schema, if we have entered 116 instead of 16 for the second place score, our schema will fire an error because what we typed fails to fulfill our Schematron assert test. Notice that we need to use an Xpath number() function for our rule to treat the contents of the first and second elements as a numerical value to be compared. Note: XPath functions that return numerical values are frequently used in Schematron for comparison tests. Some of these functions operate over text content that needs to be converted to numbers as we did here, and some of them calculate and measure things (like string-length() to return a numerical value. Here are the standard wyas to indicate comparisons in XPath and Schematron:

A note on inconsistency between Relax NG and Schematron: Even if you write a Relax NG schema as we did for our gameResults.xml file to define and xsd:integer data type for the element contents of first, second, and third, we discover that our Schematron still reads the contents of those elements as a string of text until we convert them to a number in XPath. The Relax NG grammar constructs a numerical data format, then, that is nevertheless not read as a number by an XPath parser unless it is prompted to do so. (We provide links to our sample Relax NG and Schematron files so you can test this for yourself.)

A report rule

Now that we have a working schema rule to test the difference between the first- and second- place scores, let’s make a rule that tests the second- and third-place scores. The rule is essentially the same (the second-place score is always greater than the third-place score), but we’ll use the report element instead to demonstrate how it works. We must add a new test within our rule since it shares the same @context in the gameResults element. Note: If we attempt to define a new rule with the same:@context as the first, one of the two rules will be applied and the other ignored! So within a given rule @context, we need to define all our assert and report tests together.

When we write a report element, we are saying to tell us (flag or report) when a particular condition in an @test is met. The difference between assert and report then, is that an assert test fires and error when its assertion is violated, while a report test fires and error when its condition is met. In this case, we call for a report when the second-place score (or current context) is less than or equal to the third-place score. Using report in our second test in the example below, the rule will fire when these conditions are met.

<pattern>
	    <rule context="gameResults">
            
            <assert test="number(first) gt number(second)">
                The first-place score must be greater than the second-place score.
            </assert>
            <report test="number(second) le number(third)">
                The second-place score must be greater than the  third-place score.
            </report>
        </rule>
</pattern>

Here is another way we might write that report statement, to illustrate how we might use the XPath function not() wrapped around a test value:

 <report test="not(number(second) gt number(third))">
                The second-place score must be greater than the  third-place score.
            </report>
   	

Associating a Schematron schema with your XML and testing it

Associating a Schematron schema is a lot like associating a Relax NG schema. While viewing your XML document, in the taskbar, click on Document -> Schema -> Associate Schema. From there, locate your schema file (the file extension should be .sch). When you associate a .sch file, <oXygen/> should automatically set the schema type to Schematron. A note on mindful file management: Remember to save your Schematron in a directory where you can easily and consistently locate it. Finalize that, and <oXygen/> should insert a superscript that looks like this:

<?xml-model href="your_file_name.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>

If you also have a Relax NG schema associated, you will have two different schema lines at the top of your XML document. The two different kinds of schema will function together so that as you code the red square in <oXygen/> will appear as validation errors. The bottom window will feature messages associated with these validation errors, and this will include the messages you write in the text content of your Schematron assert and report elements.

When you associate your schema, always tinker with your XML to create conditions that will cause your Schematron rules to fire! Testing your schema code should be a back-and-forth process to ensure that your assert and report tests are functioning as you want them to.

More information and examples