grouping by group-adjacent=”boolean(self::lb)”

A project I was doing some work for had some input that looked like:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader xmlns:xi="http://www.w3.org/2001/XInclude" type="text">
<fileDesc>
   <titleStmt>
      <title>A sample file</title>
   </titleStmt>
   <publicationStmt>
      <distributor>InfoDev</distributor>
   </publicationStmt>
   <sourceDesc>
      <p>VSARPJ project</p>
   </sourceDesc>
</fileDesc>
<profileDesc>
   <creation>
      <date/>
   </creation>
   <langUsage>
      <language ident="ojp">Old Japanese</language>
   </langUsage>
   <textClass>
      <catRef target="#bussoku"/>
   </textClass>
</profileDesc>
<encodingDesc>
   <samplingDecl>
      <p>This text was transcribed phonemically and edited to parallel the content from the
         corresponding item in the <title>Nihon koten bungaku taikei</title> version of the
            <title>Man'yôshû</title>, <ref>Man'yôshû I</ref>. </p>
   </samplingDecl>
</encodingDesc>
</teiHeader>
<text>
<body xml:id="BS.1">
   <div>
      <ab type="original" xml:lang="ojp"> 美阿止都久留 <lb xml:id="BS.1-orig_1"
            corresp="#BS.1-trans_1"/> 伊志乃比鼻伎波 <lb xml:id="BS.1-orig_2" corresp="#BS.1-trans_2"
         /> 阿米爾伊多利 <lb xml:id="BS.1-orig_3" corresp="#BS.1-trans_3"/> 都知佐閇由須礼 <lb
            xml:id="BS.1-orig_4" corresp="#BS.1-trans_4"/> 知知波波賀多米爾 <lb xml:id="BS.1-orig_5"
            corresp="#BS.1-trans_5"/> 毛呂比止乃多米爾 </ab>
      <ab type="transliteration" xml:lang="ojp-Latn">
         <s>
            <phr>
               <phr>
                  <cl>
                     <phr type="arg">
                        <w>
                           <m type="prefix">
                              <c type="phon">mi</c>
                           </m>
                           <w>
                              <c type="phon">ato</c>
                           </w>
                        </w>
                     </phr>
                     <w type="verb" function="adnconc" ana="#L031144">
                        <c type="phon">tukuru</c>
                     </w>
                  </cl>
                  <w type="verb" function="adnconc" ana="#L031144">
                     <c type="phon">tukuru</c>
                  </w>
                  <lb xml:id="BS.1-trans_1" corresp="#BS.1-orig_1"/>
                  <w>
                     <c type="phon">isi</c>
                  </w>
                  <w type="particle" subtype="case" function="gen" ana="#L000520">
                     <c type="phon">no</c>
                  </w>
               </phr>
               <w>
                  <c type="phon">pibiki</c>
               </w>
               <w type="particle" subtype="top" ana="#L000522">
                  <c type="phon">pa</c>
               </w>
            </phr>
            <lb xml:id="BS.1-trans_2" corresp="#BS.1-orig_2"/>
            <cl>
               <phr>
                  <w>
                     <c type="phon">ame</c>
                  </w>
                  <w type="particle" subtype="case" function="dat" ana="#L000519">
                     <c type="phon">ni</c>
                  </w>
               </phr>
               <w type="verb" function="infinitive" ana="#L030170">
                  <c type="phon">itari</c>
               </w>
            </cl>
            <lb xml:id="BS.1-trans_3" corresp="#BS.1-orig_3"/>
<!-- etc -->
         </s>
      </ab>
   </div>
</body>
</text>
</TEI>

What they wanted as output was a table-layout (icky) that aligned two nested tables of the original and the transliteration like:

<table>
   <tr>
      <td>
         <table>
            <tr>
               <td><span class="origLine">美阿止都久留</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">伊志乃比鼻伎波</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">阿米爾伊多利</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">都知佐閇由須礼</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">知知波波賀多米爾</span></td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="origLine">毛呂比止乃多米爾</span></td>
            </tr>
         </table>
      </td>
      <td>
         <table>
            <tr>
               <td><span class="w">miato</span>
                  <span class="w">tukuru</span>
                  <span class="w">tukuru</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">isi</span>
                  <span class="w">no</span>
                  <span class="w">pibiki</span>
                  <span class="w">pa</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">ame</span>
                  <span class="w">ni</span>
                  <span class="w">itari</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">tuti</span>
                  <span class="w">sape</span>
                  <span class="w">yusure</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">titipapa</span>
                  <span class="w">ga</span>
                  <span class="w">tame</span>
                  <span class="w">ni</span>
               </td>
            </tr>
            <tr>
               <td></td>
            </tr>
            <tr>
               <td><span class="w">moropito</span>
                  <span class="w">no</span>
                  <span class="w">tame</span>
                  <span class="w">ni</span>
               </td>
            </tr>
         </table>
      </td>
      <td>BS.1</td>
   </tr>
</table>

If we ignore the icky aspect of using tables for layout and alignment purposes, then the solution has something interesting to learn from. This is, at heart, a grouping problem. The solution I came up with was:

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xpath-default-namespace="http://www.tei-c.org/ns/1.0" version="2.0">
    <xsl:template match="TEI">
        <html>
            <head>
                <title>test corpus</title>
            </head>
            <body>
                <xsl:apply-templates/>
            </body>
        </html>
    </xsl:template>
    
    <!-- You can put things you want to do nothing to all in one template -->
    <xsl:template match="teiHeader | note | entry | list"/>
    
    <!-- Or similarly things you want to just have the tags vanish from.  w is here and elsewhere, hence priority. -->
    <xsl:template match=" choice | m |w | s |phr|cl " priority="-1"><xsl:apply-templates/></xsl:template>
    
    <!-- If you are using tables for layout purposes (icky) then you don't need to change lb's to BRs. -->
    <!--
    <xsl:template match="lb">
        <br/>
     </xsl:template>-->
    
    <xsl:template match="body">
        <table>
            <tr>
                <td>
                    <!-- Nesting an icky table for each cell, so you can get one tr per line. -->
                   <table> 
                       <xsl:apply-templates select="descendant::ab[@type='original']"/>
                   </table> </td>
                <td>
                    <!-- Nesting an icky table for each cell, so you can get one tr per line. -->
                    <table><xsl:apply-templates select="descendant::ab[@type='transliteration']"/></table>
                 </td>
                <td>
                    <xsl:value-of select="@xml:id"/>
                </td>
            </tr>
        </table>
    </xsl:template>
    
    <!-- Not really necessary but in case you wanted to be able to do something with the original lines, wrap an element around them. -->
    <xsl:template match="ab[@type='original']//text()"><span class="origLine"><xsl:value-of select="normalize-space(.)"/></span></xsl:template>
    
    <!-- For original things group by any child nodes or text, and create the groups adjacent to whether there is a linebreak or not. -->
    <xsl:template match="ab[@type='original']">
        <xsl:for-each-group select="child::node()| child::text()"  group-adjacent="boolean(self::lb)">
        <tr><td><xsl:apply-templates select="current-group()"/></td></tr>
        </xsl:for-each-group>
        </xsl:template>
    
    <!-- For transliterations first flatten hierarchy (you could do this a variety of ways), by copying just the top w elements and linebreaks, and for each of these group adjacent to the line breaks. -->
    <xsl:template match="ab[@type='transliteration']">
        <xsl:variable name="test"><xsl:copy-of select=".//w[not(ancestor::w)] | .//lb"/></xsl:variable>
        <xsl:for-each-group select="$test/*" group-adjacent="boolean(self::lb)">
                    <tr>
                        <td><xsl:apply-templates select="current-group()"/></td>
                    </tr>
            
        </xsl:for-each-group>
    </xsl:template>
    
    <!-- Since we have w's nested inside w's when we have one of the top ones wrap and element around it, and then take the value stripping out any spaces. (other ways to do this as well). -->
    <xsl:template match="w[not(ancestor::w)]"><span class="w"><xsl:value-of select="translate(normalize-space(.), ' ', '')"/></span><xsl:text> </xsl:text></xsl:template>
</xsl:stylesheet>

Most of this is pretty straightforward, and I’ve included comments in the XSLT to help anyone wondering why I’m doing something. But if we look at just one bit of it:

  <!-- For original things group by any child nodes or text, and create the groups adjacent to whether there is a linebreak or not. -->
    <xsl:template match="ab[@type='original']">
        <xsl:for-each-group select="child::node()| child::text()"  group-adjacent="boolean(self::lb)">
        <tr><td><xsl:apply-templates select="current-group()"/></td></tr>
        </xsl:for-each-group>
     </xsl:template>
 

The reason this is interesting is using @group-adjacent=”boolean(self::lb)”. I’m using the truth or falseness of whether the current node is a line-break element as a test to group the adjacent nodes. In XSLT2 there are basically two types of grouping conditions, patterns and expressions. @group-starting-with and @group-ending-with require their values to be a pattern, but @group-by and @group-adjacent accept any XPath expression. This means with those two you can have a bit more fun! In these the condition is being applied to each item in the population you are grouping in order to calculate grouping keys. In those accepting patterns, the condition must match specific nodes in this population that will either lead or terminate a newly-created group. This is an important distinction to keep in mind and means that with group-adjacent you can use things that calculate the key to be matched rather than being that key. So in this case we use boolean(self::lb) to test whether the current node being matched is a or not. If it is, then the grouping condition is true so it creates the group based on its siblings.

Posted in TEI, XML, XSLT | Leave a comment

Leave a Reply