Issue 16, 2012-02-03

Using XSLT’s SQL Extension with Encyclopedia Virginia

This paper explores how to integrate data across a hybrid relational database and XML-based management system. It examines specifically how XSLT’s SQL extension can be used to communicate information between SQL tables and TEI-conformant XML documents to make data-centric content more manageable and flexible and thereby leverage the strengths of both systems. In what follows, one will learn about some of the methods, benefits, and shortcomings of XSLT’s SQL extension in the context of Encyclopedia Virginia, an open access publication of the Virginia Foundation for the Humanities that utilizes a suite of digital humanities and digital library XML vocabularies such as TEI and METS.

by Matthew Gibson

Background and Context

­­­­­­­Some Questions

So why and when would one want to mix XSLT—arguably one of the most powerful languages for performing XML transformations—with SQL expressions, a language for talking to, managing, and extracting information from a relational database? Having cut my own XML teeth beginning in 1998 marking up and managing the markup of electronic texts at the University of Virginia Library, I embraced a poor assumption that mixing XML and relational databases didn’t make much sense. Why, I thought at the time, would you want to store richly structured information—information that, in some ways, was its own self-described and self-contained relational database—inside of a relational database management system? It just seemed redundant and not really keeping with the spirit of what XML was about: platform independent content that could be read and generally understood by humans and easily manipulated by machines.

My approach and view of XML during this time was less that it was an efficient and consistent way to transport data and more that it provided the rules by which I could describe complexly structured literary texts with accurate structural and semantically rich markup. I was an XML purist: disdainful of textual projects that did not employ XML applications in their production and reluctant to see and imagine the strengths and improved flexibility that mixing XML and XML tools with other environments and systems—such as a Relational Database Management System (RDBMS)—could yield for content and data.

A Hybrid Publication System

In 2005, I started developing and building Encyclopedia Virginia (EV), a digitally-born, open access publication of the Virginia Foundation for the Humanities that explores the history and culture of the state of Virginia. Bringing to this project all of my XML biases and experiences, I decided that all encyclopedia entries—i.e. the textual essays—in EV would conform to the TEI (Text Encoding Initiative) P5 guidelines[1] and that all media—still and moving images, and audio files—would be described using different metadata standards inside of a METS (Metadata Encoding and Transmission Standard) wrapper. Thus the content structure of an entry and that entry’s relationship to other entries and other objects (e.g. media, external URLs, etc.) would be defined and “preserved” in the deeper hierarchies of the XML to allow for the content’s portability to different platforms and contexts.

While much of the semantic and mixed content markup that XML affords might get lost, or might be more difficult to reconstruct from a pure RDBMS environment, there are other data and workflow requirements for Encyclopedia Virginia that make an XML/MySQL hybrid publication system a better solution than what a pure XML database option might provide. The strengths that an RDBMS provides for EV are:

  • version control over every piece of content that goes into the encyclopedia;
  • one-to-many relationship management of, for instance, one author and/or editor to many articles, one chronological event referenced by many articles, and one media object shared by many articles;
  • and, most importantly, more efficient and scalable performance in looking up and retrieving data.

A very generalized view of Encyclopedia Virginia‘s publication workflow for an entry looks something like this:

  1. XSLT transforms word processed and edited entry to TEI P5-compliant XML
  2. PHP shreds TEI into appropriate MySQL tables (e.g. authors, editors, events, and geolocations) but retains the canonical and intact XML document for future editing and versioning

A SQL Challenge for XSLT

Although all of EV‘s content is stored in and delivered from a LAMP (Linux, Apache, MySQL, and PHP) platform, the original TEI documents that are ingested into the MySQL tables are created and normalized with a series of XSL transformations before the PHP/MySQL processing takes place. Over time we have realized that in the TEI there are several structures that are better stored and managed through MySQL and that, instead of describing those in the XML, it is more efficient and consistent to point to unique keys in the database that represent that content.

One challenge, in particular, was how TEI might reference authors. Because EV publishes the affiliation of each author and because we want that information to be consistent across, say, multiple entries that a single author might write, marking up that text across multiple TEI documents can lead to variability. What is more, if a given author contributes a large number of articles and her affiliation needs to change, it is difficult to manage that change over multiple files. In the past, marking up an author across TEI documents might look something like this (Figures 1 & 2):

Figure 1. TEI snippet: “Divorce in Early Virginia Indian Society”


<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="DivorcePC">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Divorce in Early Virginia Indian Society</title>
        <respStmt>
          	<resp>Contributor</resp>
          	<persName>
          		<forename>John</forename>
          		<surname>Doe</surname>
  		<affiliation>professor of history at Collegiate University</affiliation>
</persName>
        </respStmt>
</TEI>

Figure 2. TEI snippet: “Pocahontas”


<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="Pocahontas">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Pocahontas (d. 1617)</title>
        <respStmt>
          	<resp>Contributor</resp>
          	<persName>
          		<forename>John</forename>
          		<surname>Doe</surname>
  		<affiliation>professor emeritus of history at Collegiate University and author of <hi rend="italic">Love and Death in the Virginia Colony</hi> (1990) and <hi rend="italic">Hard Times for Early Virginia Coloniasts</hi> (2005)
</affiliation>
</persName>
        </respStmt>
</TEI>

Between these two entries by John Doe, the author’s affiliation has changed. While one could certainly reflect these changes consistently in the XML with a little searching, cutting and pasting, we decided it would be much easier and more modular to control the author’s relationship to his biography and to the entries he contributed by inputting and storing that information directly in the SQL database. Thus if we had to change any of the fields related to that author, such as his affiliation, we would do it once, and those changes would be reflected in all of the entries that he wrote. Thus each of the entry snippets above would, before being published, look like this (Figure 3):

Figure 3. TEI snippet: “Pocahontas” with pointer to primary index in author table

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="Pocahontas">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Pocahontas (d. 1617)</title>
            <respStmt>
               <resp>Contributor</resp>
               <name key="195"/>
            </respStmt>

where the result in the author table with primary index matching “195” contains the following (Figure 4):

Figure 4. Result from SQL table of author with with a primary index of 195

*************************** 1. row ***************************

contributor_id: 195

salutation: Dr.

first_name: John

last_name: Doe

middle_name: D.

bio: professor emeritus of history at Collegiate University and author

of <em>Love and Death in the Virginia Colony</em> (1990) and <em>Hard

Times for Early Virginia Coloniasts</em> (2005)

With the reference now made consistently across all entries written by John Doe, we can pick and choose which values we want to display and how we want to display them. However, the missing piece here was how we would get that reference to the author into the XML itself. If we still valued keeping the XML intact for future portability, we had to figure out a way to retain the relationship between the TEI file and the author’s key in the database. And the challenge, of course, was to get those keys into the XML in an automated way. Because we use XSLT to transform, test, and normalize all of our XML before we publish it, we looked at Saxon’s XSLT SQL extension as a logical solution to create this relationship.

The Saxon SQL Extension

While Xalan—the Apache project’s XSLT processor—has an SQL extension library, because our production process uses features of XSLT 2.0 we turned to the Saxon processor’s SQL extension library. (NB: while you can run Saxon’s SQL extension with command-line parameters, all work below is done in the environment of oXygen’s XML Editor.)

Prerequisites

To get the Saxon SQL extension up and running and talking to your database you need several things:

  • The Java Database Connectivity (JDBC) driver for whatever database you are running—in our case it is MySQL.
  • The saxon-sql.jar (recent releases of oXygen come prebundled with this)
  • A configuration file that allows oXygen to see the saxon-sql.jar for validation and processing.
  • Appropriate credentials (i.e. username and password) to access the database with which you want to communicate and that your database server is configured to be accessed by remote machines.

XSLT SQL elements

Saxon’s SQL extension defines five new XSL elements for interacting with a database that are all bound to the “sql” namespace: sql:connect, sql:query, sql:insert,

sql:column, and sql:close. I actually don’t recommend using sql:insert and sql:column very much since injecting XML data into what, typically, is the more controlled environment of a relational database introduces too much variability and presents possibilities to really cause some chaos.

The other three sql elements work as such:

<sql:connect> establishes, as you might have guessed, the connection to the database and the attributes that it takes (driver, database, user, and password) provide the information that will validate the connection.

<sql:query> does the real work for us here. It performs the query on the database and writes the results of the query to the result tree. While there are six attributes for sql:query, the main ones we use are:

  • table—which defines the name of the table to be queried. This attribute is mandatory.
  • column—defines the name of the column or columns to be retrieved. Using “*” as the value of the column attribute selects all columns. This attribute is also mandatory.
  • where—defines the conditions to be applied in a given selection. This is optional.

<sql:close> closes the sql connection.

An XSLT Example: authors2sql.xsl

Given these elements—and the snippets from Figures 1 and 2 above as our source tree—let’s look at a sample of XSLT code that places these elements in context:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:tei="http://www.tei-c.org/ns/1.0"
    exclude-result-prefixes="xs"
    version="2.0" xmlns:sql="http://saxon.sf.net/sql" extension-element-prefixes="sql">
    <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

        <!-- The following parameters define the driver, the database that the connection will need to open, and the credentials to pass to the database to verify permissions --> 

    <xsl:param name="driver" select="'org.gjt.mm.mysql.Driver'"/>
    <xsl:param name="database" select="'jdbc:mysql://DATABASEURL/DATABASENAME'"/>
    <xsl:param name="user" select="'USERNAME'"/>
    <xsl:param name="password" select="'PASSWORD'"/>

        <!-- The template match selects the nodeset that will need to be analyzed for the author's name -->

<xsl:template match="//tei:titleStmt/tei:respStmt">
        <xsl:message>Connecting to database...</xsl:message>

        <!-- The "connection" variable establishes JDBC connection by selecting as its value the SQL connection to the database. The use of the java namespace types the variable as a java language object in order to validate the connection type. The value of the connection variable is determined by the values of the parameters created above -->

        <xsl:variable name="connection" as="java:java.lang.Object" xmlns:java="http://saxon.sf.net/java-type">
            <sql:connect driver="{$driver}" database="{$database}" user="{$user}" password="{$password}">

	<!-- Used primarily for debugging, if, for whatever reason, the credentials or something incorrect is passed in the connect statement, the process will terminate with the following message -->
                <xsl:fallback>
                    <xsl:message terminate="yes">Connection to MySQL failed.</xsl:message>
                </xsl:fallback>
            </sql:connect>
        </xsl:variable>

        <!-- If the variable is successfully established, the following message will be output -->
                 <xsl:message>Connected...</xsl:message>

        <!-- The following variables select the context values from the source tree to test for values in the database to retrieve the correct key -->
<xsl:variable name="fname" select="normalize-space(tei:persName/tei:forename)"/>
<xsl:variable name="lname" select="normalize-space(tei:persName/tei:surname)"/>

        <!-- the contrib-table and contribid variables below select an SQL query statement as their values. Components of the queries include: the value of the connection variable created above, the name of the table inside of the database upon which you want to perform the query, the conditions you are establishing for your selection, and the columns that you want to retrieve if those conditions are true. -->

<xsl:variable name="dbase_lname">
            <sql:query connection="$connection" table="contributors" where="last_name='{$lname}' and first_name='{$fname}'" column="last_name"/>
</xsl:variable>

<xsl:variable name="dbase_fname">
            <sql:query connection="$connection" table="contributors" where="last_name='{$lname}' and first_name='{$fname}'" column="first_name"/>
</xsl:variable>

<xsl:variable name="contrib-table">
                <sql:query connection="$connection" table="contributors" where="last_name='{$lname}' and first_name='{$fname}'"
                    column="*"/>
</xsl:variable>

<xsl:variable name="contribid">
            <sql:query connection="$connection" table="contributors" where="last_name='{$lname}' and first_name='{$fname}'"
                column="contributor_id"/>
</xsl:variable>

        <!-- the following messages can be used as debugging statements to see what type of data the query has retrieved. If the conditions you establish are true, for example, the value of the $dbase_lname variable below will be a copied XML nodeset from the MySQL database if the $lname and $fname variables are equal to values in the table. The result returned would look like this in the message: <row><col>last_name</col></row> -->

            <xsl:message>First name: <xsl:copy-of select="$fname"/></xsl:message>
            <xsl:message>Last name: <xsl:copy-of select="$lname"/></xsl:message>
            <xsl:message>Table: <xsl:copy-of select="$dbase_lname"/></xsl:message>
            <xsl:message>Contrib ID: <xsl:copy-of select="$contribid"/></xsl:message>

        <!-- the choose sequence below evaluates the tei:respStmt nodeset with the following conditionals: 1) if tei:name exists (i.e. if you have already run this stylesheet on this document and have generated the contributor's id), then you will just want to copy the tei:respStmt source tree into the result tree 2) if, on the other hand, any of the values of contrib-table match the last name and first name values extracted from the XML, then the stylesheet will copy the contributor_id from that row where those conditions are true 3) otherwise, all children of the tei:respStmt are copied into the result tree. -->

<xsl:choose>
 	<xsl:when test="tei:name">
                <xsl:copy-of select="."/>
            </xsl:when>
            <xsl:when test="matches($contrib-table, $lname) and matches($contrib-table, $fname)">
                <respStmt xmlns="http://www.tei-c.org/ns/1.0">
                    <resp xmlns="http://www.tei-c.org/ns/1.0">
                        <xsl:value-of select="tei:resp"/>
                    </resp>
                    <name xmlns="http://www.tei-c.org/ns/1.0">
                        <xsl:attribute name="key">
                            <xsl:copy-of select="$contribid"/>
                        </xsl:attribute>
                    </name>
                </respStmt>
            </xsl:when>
            <xsl:otherwise>
                <respStmt xmlns="http://www.tei-c.org/ns/1.0"><xsl:copy-of select="*"/></respStmt>
            </xsl:otherwise>
        </xsl:choose>

<!-- Finally, before the xsl:template closes, you utilize the sql:close element and identify the value of the connection you want to close (which is the connection variable defined at the beginning of the template). -->

        <sql:close connection="$connection"/>
    </xsl:template>
   </xsl:stylesheet>

Summary

Despite these drawbacks, if a situation arises that requires information interchange between an XML-formatted document and a relational database, XSLT’s SQL extension can bridge these two platforms that describe information and information relationships in very different ways. What is more, when a project’s needs require leveraging the strengths of XML and a relational database—when managing and delivering information necessitates a hybrid solution between complex XML description and structure and a relating data efficiently—XSLT can remain a powerful tool in your suite of options for working with XML content.

[1] The P5 specification was in the process of being created during this period. The official release came about in 2007.

About the Author

Matthew Gibson is Director of Digital Programs at the Virginia Foundation for the Humanities (VFH). He holds a Ph.D. in English from the University of Virginia. Prior to joining the VFH in 2005, Matthew served as Associate Director of the University of Virginia Library’s Electronic Text Center. At the VFH, he oversees Encyclopedia Virginia, a digital publication about Virginia history and culture; provides supervision and support for Documents Compass, a Mellon- and NHPRC-funded initiative to facilitate digitization and interoperability between documentary editions; and guides the short- and long-term planning for the organization’s digital efforts and scholarly communications. Matthew has a multitude of conference presentations and several publications to his credit, mostly in the field of digital library standards. For the past nine years he has taught week-long XML application building workshops as an independent consultant and for the Association of Research Libraries.
URLs: http://virginiahumanities.org and http://encyclopediavirginia.org.

One Response to "Using XSLT’s SQL Extension with Encyclopedia Virginia"

Please leave a response below:

  1. Bernadette,

    I see where the variable ‘contribid’ gets assigned a value in lines 50-53, but I can’t figure out where the variable ‘contrib-table’ gets a value to use in line 68? Is it picked up from the ‘table’ attribute value in the sql:query somehow?

    Otherwise, thanks so much for this! It’s a straight-forward example!

Leave a Reply

ISSN 1940-5758