simple dynamic transformation of xml with htaccess, php, and xslt

I often transform from TEI XML to XHTML as part of projects, but in some instances it is more difficult to manage using things like the eXist XML Database or Apache Cocoon, or even AxKit. This is because the hosting arrangement means that only a limited number of technologies are available.

In most cases these days a linux-based server will have Apache’s http server installed, and hopefully the Apache ReWrite module installed. In addition most hosting, even shared hosting, has PHP installed with libxml for XSL processing. Sadly, this only copes with XSLT1 not XSLT2.

However, one way to use this is to have one’s .htaccess file rewrite incoming URLs to run an xml2html.php conversion.

Basic preceding stuff:

#Turn on Rewriting
RewriteEngine On
RewriteBase /
# Redirect any svn requests 
RewriteRule ^.svn/(.*)$ http://subversion.tigris.org [R]
# utf-8 please
AddDefaultCharset UTF-8
# change directory index to index.xml as default
DirectoryIndex index.xml index.php index.html index.shtml
#ErrorDocuments
ErrorDocument 404 /unavailable.html
ErrorDocument 403 /forbidden.html

Here we start by turning the RewriteEngine on and setting the RewriteBase to the root of the domain. I’ve also got a RewriteRule that takes any requests for stuff in subversion directories and redirects it to the subversion site instead. (Though actually I’m thinking of having that just 404 or 403 instead.) After that we set the default character set to UTF-8 and change the default directory index file names. and specify some error documents for 404s and 403s. (These are of course actually unavailable.xml and forbidden.xml, and are transformed by the rule further down.)

After this comes the bit where the rewriting of requests for HTML files get turned into parameters on a PHP script:

# If I ask for .xhtml then give me xml2html
RewriteRule ^(.*).xhtml$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]
# If I have asked for .html then if the .html file exists, then give it.
RewriteRule   ^(.*)\.html$              $1      [C,E=WasHTML:yes]
RewriteCond   %{REQUEST_FILENAME}.html -f
RewriteRule   ^(.*)$ $1.html [L]
# else provide XML dynamically with xml2html.php
RewriteCond   %{ENV:WasHTML}            ^yes$
RewriteCond   %{REQUEST_FILENAME}.xml -f
RewriteRule ^(.*)$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]

The first of these says that when I ask for any url on the site ended in .xhtml then take an XML file named the same thing and transform it using the xml2html.php script and the site.xsl stylesheet both in the /scripts directory. This is just for me, so that I can force it to run the transformation if a foo.xml and foo.html exist in the same directory.

After this the next RewriteRule matches anything on the site that is asked for that ends in .html and takes the first bit of this (the path and filename). Simultaneously it uses ‘C’ to chain this with the next rule and ‘E’ to set an environmental variable ‘WasHTML’ to be ‘yes’. Then there is a Rewrite Condition testing if this filename with a .html extension exists. If so, it rewrites this to be that filename.html and ends. If not, it tests whether the environmental variable WasHTML is set to yes (because remember we’ve taken off the extension), and whether the filename we’ve asked for ending in .xml exists. If so, then it runs the script giving the filename with .xml as the xml parameter and in this case site.xsl (in the same scripts directory) as the xsl.

That .htaccess file as a whole looks like:

#Turn on Rewriting
RewriteEngine On
RewriteBase /
# Redirect any svn requests 
RewriteRule ^.svn/(.*)$ http://subversion.tigris.org [R]
# utf-8 please
AddDefaultCharset UTF-8
# change directory index to index.xml as default
DirectoryIndex index.xml index.php index.html index.shtml
#ErrorDocuments
ErrorDocument 404 /unavailable.html
ErrorDocument 403 /forbidden.html
# If I ask for .xhtml then give me xml2html
RewriteRule ^(.*).xhtml$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]
# If I have asked for .html then if the .html file exists, then give it.
RewriteRule   ^(.*)\.html$              $1      [C,E=WasHTML:yes]
RewriteCond   %{REQUEST_FILENAME}.html -f
RewriteRule   ^(.*)$ $1.html [L]
# else provide XML dynamically with xml2html.php
RewriteCond   %{ENV:WasHTML}            ^yes$
RewriteCond   %{REQUEST_FILENAME}.xml -f
RewriteRule ^(.*)$ /scripts/xml2html.php?xml=../$1.xml&xsl=site.xsl&%{QUERY_STRING} [L]

The PHP script this is using (which I borrowed from a colleague) uses the http://www.php.net/manual/en/book.xsl.php libxml based XSLT processing in PHP. It is fairly short and consists of:

<script language="php">
#Basic check for directory/site traversal 
if(preg_match('/\.\.\/\.\./',$_REQUEST['xml'])) { die("invalid input"); }
if(preg_match('/http/',$_REQUEST['xml'])) { die("invalid input"); }
if(preg_match('/http/',$_REQUEST['xsl'])) { die("invalid input"); }
if(preg_match('/\.\.\//',$_REQUEST['xsl'])) { die("invalid input"); }
#load xsl document into XsltProcessor
  $xp = new XsltProcessor();
  $xsl = new DomDocument;
  $xsl->load($_REQUEST['xsl']);
  $xp->importStylesheet($xsl);
#load xml document
  $xp->setParameter( null, 'xml', $_REQUEST['xml']);
  $xml_doc = new DomDocument;
  $xml_doc->load($_REQUEST['xml']);
#Process any xincludes
  $xml_doc->xinclude();
#Transform the XML with the XSL or put out error
  if ($html = $xp->transformToXML($xml_doc)) {
      echo $html;
  } else {
      trigger_error('XSL transformation failed.', E_USER_ERROR);
  }
</script>

The first bit of this is just a security precaution against directory (or site) traversal which rejects anything that has ‘../..’ in it or ‘http’. I’m sure there are a lot better ways to do this, but just checking the xml and xsl parameters seemed the easiest. I could have made a function and then passed it to each of them, or had the regex look for either of these two things, but I think it all works out the same and doesn’t seem to have much of a speed implication. Then we start a new XsltProcessor(), and a new xsl DomDocument, we load in the xsl file given in the xsl parameter, and also pass to this the parameter ‘xml’ so that we can use this in our XSLT if we want. Then we start a new xml_doc DomDocument and load in the requested XML file, and we do any XIncludes in that XML file. We then transform the XML doc to HTML with transformToXML otherwise trigger and error and put that out.

This is a fairly lightweight way to transform XML to HTML on the fly using the technologies (PHP and .htaccess) that most hosting solutions provide. I’m using something like this on one of my personal sites and it is in use in a slightly different form in a number of work sites.

Hope it is useful to someone.

Posted in other, TEI, XSLT | Leave a comment

Leave a Reply