Tokenizing and grouping rhyme schemes with XSLT functions

There is a project I work for which has encoded rhyme schemes in TEI using the @rhyme attribute on <lg> elements.  This contains some complex strings as they have used parentheses to indicate an internal rhyme and asterisks to indicate whether a particular rhyme is a feminine (multi-syllable) rhyme. Rhymes are also marked   So for example you get values that look like:

rhyme=”(a*)a*(a*)b(c*)c*(c*)bddee(f)fg(h)hg/”

But I need at any particular point to be able to get at least 2 things from this string:

  1. The documented rhyme above for the current <rhyme> element that I’m processing
  2. Whether the current rhyme is an internal (parentheses) or a feminine (asterisk) rhyme or not.
  3. The set of rhymes for the current line
  4. Whether the current line has any internal (parentheses) or feminine (asterisk) rhymes or not.

So the first step with this is to tokenize the given rhyme scheme.  I do this as an XSLT function and if I want to output it I could have something like:

 <xsl:variable name="rhyme">
(a*)a*(a*)b(c*)c*(c*)bddee(f)fg(h)hg/
</xsl:variable>
<tokenized-rhymes>
  <xsl:copy-of select="jc:tokenizeRhymes($rhyme)"/>
</tokenized-rhymes>

Here, inside some unseen template, I’ve got a variable with the rhyme scheme in it, and I’m getting a copy-of the output of a function I’ve created called jc:tokenizeRhymes(). This isn’t a very difficult XSLT function it just consists of some xsl:analyze-string as so:

<xsl:function name="jc:tokenizeRhymes" as="item()*">
<xsl:param name="rhyme"/>
<xsl:variable name="rhymes">
<list>
    <xsl:analyze-string select="$rhyme" regex="\(*[a-zA-Z]\**\)*">
        <xsl:matching-substring>
            <item>
                <xsl:value-of select="."/>
            </item>
        </xsl:matching-substring>
        <xsl:non-matching-substring/>
    </xsl:analyze-string>
</list>
</xsl:variable>
<xsl:copy-of select="$rhymes"/>
</xsl:function>

All this does is have a function which takes a single parameter (rhyme), and creates a variable containing a list with a bunch of items inside. To do this is uses a regular expression on xsl:analyze-string which looks optionally for an opening parenthesis \(* then any letter from a-zA-Z optionally an asterisk \** follow by an optional closing parenthesis \)* … see, simple. The output from this lookst like:


  <list>
         <item>(a*)</item>
         <item>a*</item>
         <item>(a*)</item>
         <item>b</item>
         <item>(c*)</item>
         <item>c*</item>
         <item>(c*)</item>
         <item>b</item>
         <item>d</item>
         <item>d</item>
         <item>e</item>
         <item>e</item>
         <item>(f)</item>
         <item>f</item>
         <item>g</item>
         <item>(h)</item>
         <item>h</item>
         <item>g</item>
      </list>

Well then, getting the current rhyme when I’m processing a rhyme is fairly easy then. I just create a variable $rhymePosition (the current number of rhymes I’m on) and then can call another function jc:getCurrentRhyme with that and the rhyme variable.

<xsl:variable name="currentRhyme">
  <xsl:value-of select="jc:getCurrentRhyme($rhyme, $rhymePosition)"/>
</xsl:variable>

The jc:getCurrentRhyme function is fairly straightforward as well. It looks like:

<xsl:function name="jc:getCurrentRhyme" as="item()*">
   <xsl:param name="rhyme"/>
   <xsl:param name="currentRhyme" as="xs:integer"/>
   <xsl:variable name="rhymes" select="jc:tokenizeRhymes($rhyme)"/>
   <xsl:copy-of select="$rhymes/list/item[$currentRhyme]"/>
</xsl:function>

It takes two parameters, the $rhyme and the $currentRhyme (which is an integer of how many rhymes there are so far in the <lg> including the one we are processing). It then creates a new variable $rhymes which has the output of the jc:tokenizeRhymes above. Then getting the current rhyme from the list is easy because we know its number so we just make a copy of the <item> we’ve created in that variable by using xsl:copy-of and filtering it by the number $currentRhyme. (This is why we made sure that this parameter was an integer.)

In order to check whether these are internal or feminine rhymes it is now very straight-forward, we just test the $currentRhyme we’ve created above to see whether it contains($currentRhyme, ‘)’) or contains($currentRhyme, ‘*’).

In order to get all the rhymes for this line, we need to re-process this tokenized list somewhat. We want to group those items which have parentheses together with the letter which follows them, splitting on each non-parenthesised letter (optionally having an asterisk). It took me awhile to get my brain around that but eventually I came up with:

<xsl:function name="jc:groupRhymes" as="item()*">
<xsl:param name="rhyme"/>
<xsl:variable name="rhymes" select="jc:tokenizeRhymes($rhyme)"/>
<xsl:variable name="groupedRhymes">
  <list>
   <xsl:for-each-group select="$rhymes/list/item"
      group-ending-with="*[matches(., '^[a-zA-Z]\**$')]">
     <item>
      <list>
       <xsl:for-each select="current-group()">
        <item>
         <xsl:value-of select="."/>
        </item>
       </xsl:for-each>
      </list>
     </item>
    </xsl:for-each-group>
  </list>
</xsl:variable>
<xsl:copy-of select="$groupedRhymes"/>
</xsl:function>

This function takes in the parameter $rhyme and tokenizes it using the earlier function, so now we have a list with some individual items in. It then creates a new list and uses xsl:for-each-group to select all the tokenized items. It creates groups ending with any item where the content matches a full line going from start to finish of a letter followed by an optional asterisk. This means each group will end with a normal rhyme letter and any internal rhymes (in parentheses) will be included in that group. For each group it puts out a new item with a nested list and makes each rhyme in that line an item in that nested list. This might seem overkill to some, but having the extra nesting, regardless of whether there are 1, 2, or 20 rhymes in the line just makes things easier. So this output from this looks like:

<list>
<item>
    <list>
        <item>(a*)</item>
        <item>a*</item>
    </list>
</item>
<item>
    <list>
        <item>(a*)</item>
        <item>b</item>
    </list>
</item>
<item>
    <list>
        <item>(c*)</item>
        <item>c*</item>
    </list>
</item>
<item>
    <list>
        <item>(c*)</item>
        <item>b</item>
    </list>
</item>
<item>
    <list>
        <item>d</item>
    </list>
</item>
<item>
    <list>
        <item>d</item>
    </list>
</item>
<item>
    <list>
        <item>e</item>
    </list>
</item>
<item>
    <list>
        <item>e</item>
    </list>
</item>
<item>
    <list>
        <item>(f)</item>
        <item>f</item>
    </list>
</item>
<item>
    <list>
        <item>g</item>
    </list>
</item>
<item>
    <list>
        <item>(h)</item>
        <item>h</item>
    </list>
</item>
<item>
    <list>
        <item>g</item>
    </list>
</item>
</list>

Which, admittedly, is fairly verbose. But you can now have a function that just gets the individual line’s items that you are interested in which would look something like:

<xsl:function name="jc:getCurrentLineRhymes" as="item()*">
  <xsl:param name="rhyme"/>
  <xsl:param name="currentLine" as="xs:integer"/>
  <xsl:variable name="rhymes" select="jc:groupRhymes($rhyme)"/>
  <xsl:copy-of select="$rhymes/list/item[$currentLine]"/></xsl:function>

Which when called with something like:

 <xsl:copy-of select="jc:getCurrentLineRhymes($rhyme, 4)"/>

(where ‘4’ here usually would be a variable containing the current line number) it will produce something like:

<item>
 <list>
  <item>(c*)</item>
  <item>b</item>
 </list>
</item>

Which a simple string test using contains() can again tell you whether there are any feminine (asterisk) rhymes or internal (parentheses) rhymes, etc.

Hurrah! See that wasn’t that difficult after all. In this case it makes a good example of using XSLT2 functions to call other functions to break the overall task down into manageable more object oriented-like tasks which can be re-used for a variety of purposes. (There are a lot of efficiencies which could be implemented here… the jc:getCurrentLineRhymes and jc:getCurrentRhyme are almost identical, except that one uses jc:groupRhymes() and the other uses jc:tokenizeRhymes(). This could be one function which tests a parameter to see which is intended.

The whole XSLT stylesheet is available from https://github.com/jamescummings/conluvies/blob/master/xslt-misc/tokenize-rhyme-test.xsl.

Posted in TEI, XML, XSLT | Leave a comment

Leave a Reply